On Matrix's Release Cadence & State Resolution v1

3rd Jul 2023

For those of you in the United States, tomorrow is Independence Day! 🎆

I've had a long weekend because of the holiday, which is why you get two newsletters so close together, enjoy!

On Matrix's Release Cadence

In last week's TWIM, the following was announced:

Room version 11 has been accepted! A gentle reminder for Matrix client and homeserver implementations to update and incorporate the new changes. The full list can be found in the MSC itself (mainly changes to redaction semantics and related fields).

While this means that room version 11 is now considered stable, it's not recommended that homeserver implementations set their default room version for newly created rooms to 11 yet. We recommend waiting a few months for client and other homeserver implementations to gain support first.

It appears to me that the authors of the Matrix Specification ("spec") expect developers to keep up with the MSCs and implement them as soon as they are accepted, but before the next version of the official spec is released. I don't really agree with this way of doing things, so I just want to clarify how I expect Telodendria to deal with new spec versions going forward.

It is my understanding that new versions of the spec are released every quarter. This is far too frequent, in my opinion, because I think spec stability is really important. The C standard, for example, updates roughly once a decade. I would be much happier if the spec updated at roughly this pace. The reason for this is that I want time to actually implement the spec, test that implementation, and get it in the hands of users so it can be stable. If I'm constantly adding new features to Telodendria because of new spec versions, I don't know how Telodendria will ever be stable.

My goal is to be as confident in Telodendria as I am in my SMTP server, for example. With mail servers, once you set them up, you never think twice about them because they always just do what they're supposed to. I want Telodendria to be the same way. I want it to always just work. And I don't want it to be updating constantly. I don't want a version that's only 6 months old to be out of date. If Matrix is to stand the test of time, the spec authors can't overwhelm developers with new features constantly. End users don't necessarily care about getting new features all the time, they just want something that is reliable and won't change on them in unexpected ways.

When I created the Telodendria CVS repository, the spec was at v1.3. It is now at v1.7 about a year later. For a new homeserver developer like myself, this is incredibly overwhelming! I can't even fully wrap my mind around the current spec version—let alone implement it—before another version comes out with more API endpoints and features. I have therefore determined the following regarding Telodendria's relationship with new specification versions:

Telodendria will not implement any MSC before it arrives in an officially published spec at https://spec.matrix.org. There will be no experimental or unofficial changes, and when an MSC is finalized and marked as accepted, it must be formally added to the spec. Therefore, I won't even consider the existence of Room Version 11, for example, until it shows up in spec version v1.8 at the earliest. The reason for this is because I want Telodendria to be stable. Yes, that means it's always going to be behind on the latest features, but that is a good thing the way I see it because it gives Telodendria the the chance to be reliable. I shouldn't have to keep up with the spec authors and follow their activity so I know what's happening with the spec; I should be able to implement the stable spec and refine my implementation until the next spec version is released.
Telodendria's first stable release will implement spec version v1.7. In the time between now and the release date for Telodendria's first stable release, I expect there to be many more releases of the Matrix specification, but I am freezing Telodendria's view of the spec until I've implemented everything in v.1.7. I have to draw the line somewhere because otherwise I'll never get a stable release of Telodendria out, and I don't think anyone wants that. I am therefore drawing the line here; Telodendria's first stable release, even if it comes 5 years from now, will implement the version of the spec that's out right now. I'm pretty sure it won't be 5 years though, but you get my point.
Telodendria's versioning scheme when it becomes stable will indicate which spec version is being implemented. This means that the first stable release will carry the version number v1.7.0. Expect point releases as I fix bugs, but don't expect v1.8.0 until spec v1.8 is fully implemented.
Telodendria will not skip spec versions. It's highly likely that multiple versions of the spec will be released in the time it takes me to implement v1.7. I'm not going to skip the in-between spec versions though; I'm going to implement each one in the order it came out simply because it would be too overwhelming to try to track all the changes across spec releases and combine them into one large Telodendria release. In other words, don't expect Telodendria v1.9.0 until v1.8.0 is released, implementing spec version v1.8. I'm hoping that the changes to the spec will be small enough between versions that I can comfortably implement them without getting too behind, but I know I'll probably be quite a bit behind for the first release.

What this really comes down to is that I don't buy into the rapid pace at which the spec is developed. Many core internet technologies—DNS, SMTP, etc, to name a few—have remained virtually unchanged since their introduction many decades ago, and I believe that in order to get Matrix to become a core internet technology, it has to stabilize. It has to, at some point, be finished, even if it is not perfect. That's the point of a standard; it is to be something that doesn't change, or changes very little, so it can be relied upon for many years to come. Maybe that's not the goal of the spec authors, I'm not sure. All I know is that a new spec release every quarter is simply unsustainable. We don't see four new versions of SMTP every single year, and that's because SMTP just works. Of course it is not perfect. It has its flaws, but nonetheless, people depend on it to just work and it does. My vision for Matrix is a similar situation. It doesn't have to be perfect. It probably never will be. It just has to work. And right now, it works, so I'd be hesitant to keep changing it.

Matrix Homeserver Implementation Part 2: State Resolution v1

In my previous newsletter, I gave an overview of the Matrix protocol and provided some helpful information for actually implementing the core functionality that homeservers adhering to the spec are expected to provide. I left off on State Resolution, which I felt deserved its own series of newsletters because covering it will take at least as many words as that newsletter ended up being.

This newsletter is going to have a lot of pseudocode. As I mentioned in my previous newsletter, Matrix state resolution is well-documented from a conceptual standpoint, so I don't feel the need to explain the concepts yet again. Instead, I'll focus on the far more task of translating those concepts into code. I should point out right now that my pseudocode is not Python-esque like most---rather, it resembles C, which I find easier to read and write. C is much less ambiguous. But maybe I think that just because I'm a C programmer and not a Python programmer. Regardless of your preferred language, I hope you can see past the syntax and understand what I'm writing. The data structures used can be easily inferred, although I may be biased because they very closely resemble Telodendria's actual code.

To recap briefly the important takeaways from Part 1, State is a collection of key-value pairs where the key is an event type and a state key provided by an event, and the value is an event ID. This collection of key-value pairs describes a room in its entirety; from name, topic, and aliases to user membership and permissions, everything anyone needs to know about a room is derived from the room's State.

You can think of room state as a collection of pointers to the state events that determine what the room looks like, who's in it, and what kind of events they're allowed to send. We know why state is so important, and we know where state resolution fits into the overall implementation of a homeserver, but we have yet to figure out how it actually works. As I mentioned in my last newsletter, there are two algorithms, with room version 1 using the original algorithm, and versions 2 and beyond using the new and improved algorithm. In this post, I'll set up the framework primarily for the original algorithm, but some of these details will apply also to the improved algorithm.

First, we need to figure out what the state resolution algorithm's inputs and outputs are. At a high level, we can say that the state resolution algorithm's input is a single event, and its output is the complete state of the Matrix room before that event was added to the room's directed acyclic graph (DAG). So, your state resolution function is going to look roughly like this:

    HashMap *
    StateResolve(Event *);

Note that both state resolution algorithms are defined recursively; that is, you must have resolved the state for each previous event specified by the input event before you can resolve the state for the current event. This might look like this:

    HashMap *
    StateResolve(Event *e)
    {
        Array *states;
        size_t i;

        for (i = 0; i < ArraySize(e->prev_events); i++)
        {
            Event *p = ArrayGet(e->prev_events, i);
            HashMap *state = StateResolve(p);
            /*
             * state is now the room state before p. Remember to
             * apply p to state if p is a state event so that
             * state becomes the room state *after* p.
             */
            ArrayAdd(states, state);
        }

        /*
         * states is now the set of the previous states of the
         * of the room to be resolved. We can continue with the
         * resolution algorithm.
         */
    }

An efficient homeserver will not resolve the room state at an event more than once; room state should be cached for each event that it is calculated for so that the algorithm does not need to compute redundant results. The implication of this is that a Matrix homeserver stores snapshot of the room state before each and every event in the room.

The next step in state resolution now depends on the algorithm we are using. If the room version is 1, then we have to use the original state resolution algorithm. Otherwise, we use the new, improved state resolution algorithm. We should modify the signature of our StateResolve() function to accommodate this:

    HashMap *
    StateResolve(int roomVersion, Event *e);

To make things easier, after we perform the logic shown above and we have our set of states, lets break out each version of the algorithm into its own function:

    HashMap *
    StateResolve(int roomVersion, Event *e)
    {
        Array *states; /* Array of HashMaps */

        /* ... Logic shown above ... */

        switch (roomVersion)
        {
            case 1:
                return StateResolveV1(states);
                break;
            default:
                return StateResolveV2(states);
        }
    }

Let us now implement the original state resolution algorithm. The first step is as follows:

"Start by setting R to the union of the states to be resolved, excluding any conflicting events."

What this means is that we have to build up a new partial state map by looking at each state in the incoming array of states and adding all the keys that don't already exist in the partial state. If a key does have a conflict, then we remove it from the partial state and collect the conflicting events events to resolve them. This process looks something like this:

    HashMap *
    StateResolveV1(Array *states)
    {
        HashMap *r;
        HashMap *conflicted;
        size_t i;

        for (i = 0; i < ArraySize(states); i++)
        {
            HashMap *state = ArrayGet(states, i);
            char *eventType;
            HashMap *stateKeys;

            while (HashMapIterate(state, &eventType,  &stateKeys))
            {
                char *stateKey;
                char *eventId;

                while (HashMapIterate(stateKeys, &stateKey, &eventId))
               {
                    /* (eventType, stateKey) -> eventId */
                    if (HashMapGet(HashMapGet(r, eventType), stateKey))
                    {
                        /*
                         * Conflicted state, add to conflicted and
                         * remove from r.
                         */
                    }
                    else
                    {
                        /* Add this state pair to r */
                    }
                }
            }
        }
    }

We now have R, which perfectly maps a state tuple to a single event ID. However, we also have a conflicted set, which maps a state tuple to more than one event ID. This is what we have to resolve onto R. The original state resolution algorithm states that we must deal specially with the following types of events:

m.room.power_levels
m.room.join_rules
m.room.member

For each of these event types, we are to collect all the event IDs that have these types in the conflicted set into a list. We will then sort that list by depth and event ID---well, specifically the SHA-1 hash of the event ID. Then, we will iterate over each list, adding the first event, and then making sure the rest of the events are allowed by the authorization rules in R. If they are, we keep updating R with the events, otherwise, we stop and continue on to the next list. This process is explained a little more clearly in the Matrix specification, but here's some pseudocode that shows how it's done:

    HashMap *
   StateResolveV1(Array *states)
    {
        HashMap *r;
        HashMap *conflicted;

        Array *authTypes;
        size_t i;

        /* ... Logic shown above ... */

        ArrayAdd(authTypes, "m.room.power_levels");
        ArrayAdd(authTypes, "m.room.join_rules");
        ArrayAdd(authTypes, "m.room.member");

        for (i = 0; i < ArraySize(authTypes); i++)
        {
            char *authType = ArrayGet(authTypes, i);
            HashMap *stateKeys = HashMapGet(conflicted, authType);

            Array *events;
            char *stateKey;
            Array *eventIds;

            char *eventId;
            Event *e;

            /* Assemble all the events of the given type */

            while (HashMapIterate(stateKeys, &stateKey, &eventIds))
            {
                size_t j;

                for (j = 0; j < ArraySize(eventIds); j++)
                {
                    eventId = ArrayGet(eventIds, i);

                    ArrayAdd(events, eventId);
                }
            }

            /* Sort ascending by depth, descending by SHA-1 of ID */

            ArraySort(events, EventCompareFunc);

            /* Add first event to R */

            eventId = ArrayGet(eventIds, 0);
            e = EventGet(eventId);
            HashMapSet(HashMapGet(r, e->type), e->state_key, eventId);
        }
    }

The EventCompareFunc() referenced in the above snippet is implemented roughly as follows:

    int
    EventCompareFunc(Event *e1, Event *e2)
    {
        int depth1 = e1->depth;
        int depth2 = e2->depth;

        if (depth1 > depth2)
        {
            return 1;
        }
        else if (depth1 < depth2)
        {
            return -1;
        }
        else
        {
            char *sha1 = Sha1(e1->event_id);
            char *sha2 = Sha1(e2->event_id);

            /* Descending */
            return strcmp(sha1, sha2) * -1;
        }
    }

The above code actually covers the next two steps in the algorithm as well; it sorts the event lists and it adds the first event in the list to R. The next step is to process the rest of the events in the list, which is as simple as iterating over it and performing the auth check using the partial state R.

We should then clean up the conflicted set, removing the type keys that we've processed so far, because we will have to iterate over all the remaining type keys.

But first, suppose we have this function:

    int
    AuthCheck(HashMap *state, Event *e);

This function takes in a room state and an event and determines whether or not that event is allowed to be in the room based on the authorization rules determined by the room state, such as membership, power levels, and the like. This is going to be the same function used when receiving events from the client-server API or the federation API. We also need it for state resolution. AuthCheck() should follow the authorization rules for the room version we are working with, so in practice, the function should also take a room version. Refer to the relevant section of the specification. It doesn't make sense to lay out an implementation here because the logic is very clearly laid out in the specification.

Assuming we have such a function, we can continue on with our algorithm. After we add the first event from our sorted list to R, we can use the AuthCheck() function on the remainder of the events, like this:

    HashMap *
    StateResolveV1(Array *states)
    {
        HashMap *r;
        HashMap *conflicted;
        size_t i;

        /* ... Logic shown above ... */

        for (i = 0; i < ArraySize(authTypes); i++)
        {
            char *authType = ArrayGet(authTypes, i);
            size_t j;
            Array *events;

            /* ... Logic shown above ... */

            for (j = 1; j < ArraySize(events); j++)
            {
                char *eventId = ArrayGet(events, j);
                Event *e = EventGet(eventId);

                if (AuthCheck(r, e))
                {
                    HashMapSet(HashMapGet(r, e->type), e->state_key, eventId);
                }
                else
                {
                    break;
                }
            }

            HashMapDelete(conflicted, authType);
        }
    }

Finally, one step remains in the algorithm:

"... for all other conflicts, just pick the event with the highest depth and the lowest sha1(event_id) that passes authentication in R and add it to R."

All that the specification is telling us to do here is to iterate over the remaining conflicted events, sorting them like we sorted all the other events, and then work backwards through the list until we find an event that passes authentication against R. That looks like this:

    HashMap *
    StateResolveV1(Array *states)
    {
        HashMap *r;
        HashMap *conflicted;

        size_t i;
        char *eventType;
        HashMap *stateKeys;

        /* ... Logic shown above ... */

        while (HashMapIterate(conflicted, &eventType, &stateKeys))
        {
            char *stateKey;
            Array *events;

            while (HashMapIterate(stateKeys, &stateKey, &events))
            {
                ArraySort(events, EventCompareFunc);

                for (i = ArraySize(events) - 1; i >= 0; i--)
                {
                    char *eventId = ArrayGet(events, i);
                    Event *e = EventGet(eventId);
                    if (AuthCheck(r, e))
                    {
                        HashMapSet(HashMapGet(r, eventType), stateKey, eventId);
                        break;
                    }
                }
            }
        }
    }

And that should be it! This is the basic algorithm for state resolution v1. While some details have been omitted—notably the auth check and the details of some of the primitive data structures—this should fundamentally work. I haven't tested any of this code, and I know for certain it is missing a lot of error checking, and there are probably a few things that don't exactly work as shown, but other than that, this should get the idea across, and I would expect that something similar to this will end up in the code base.

What's Next?

I'm getting kind of anxious to stop theorizing so much and actually start writing code again. Before I tackle state resolution v2, I'll probably spend a bit of time implementing a lot of the details I've laid out here and in the last newsletter. I have to figure out some of the abstractions that need to be put in place so that the nuances of each room version, including the state resolution algorithm and the auth checks, can be isolated without duplicating a bunch of code but also not creating an unreadable mess. I also just want to get things working, so for the time being I will probably just focus on implementing room version 1, and then once I know how that comes out, I'll have a better idea of the abstractions needed to make other room version implementations elegant.

Previous Post Next Post

On Matrix's Release Cadence & State Resolution v1

On Matrix's Release Cadence

Matrix Homeserver Implementation Part 2: State Resolution v1

What's Next?

Syndicate