Skip to main content

Source Control Aware Queries

Since 2021.08.30

Git support has been added 🎊

Since 4.9.

The capability name associated with this enhanced functionality is scm-since.

The capability name for this is scm-hg. The internal architecture allows supporting other source control systems quite easily; it just needs someone to implement and test them!

A common pattern for tools that consume watchman is wanting to reason about the changes in a version controlled repository. For most repos it is fine to simply receive information about all changed files as they are updated, even during a rebase over several days of work by others.

For very large or very busy repositories, where a great many files can change over a short period of time, it can be desirable to get a minimized set of information about the changes.

For example, if your tool has the ability to load some pre-built data from some artifact storage, rather than processing many hundreds of changed files incrementally you may want to take the merge base of local changes and use that to locate the pre-built data and process only the delta between that state and the current state of the repo.

An illustration may help. Here we see that a user has a stack of two commits based off the symbolic main commit. In this scenario, main is tracking the tip of the repo to which the local repo is published, and the user is checked out at the 6b38a5 commit:

| @  6b38a5  wez
| | Add cats.cpp
| |
| o fa2e92 wez
|/ Add cat.jpg
|
o f12345 main

Now the user synchronizes their repo with the remote, fetching the commits but not changing their work yet. This is often combined with the step that follows, but we are breaking it out here for the purposes of illustration. This is equivalent to running hg pull or git fetch:

o  fabf87  coworker     main
. Amazing new feature
.
| @ 6b38a5 wez
| | Add cats.cpp
| |
| o fa2e92 wez
|/ Add cat.jpg
|
o

The ellipsis portion of the DAG represents uninteresting commits to wez; there may be hundreds of files changed by those commits, but wez only cares about the work in their local branch of the DAG.

Now wez wants to rebase their work on main. This would be done using a command like hg rebase -d main -s fa2e92:

| @  bbbbbb  wez
| | Add cats.cpp
| |
| o aaaaaa wez
|/ Add cat.jpg
|
o fabf87 coworker main
. Amazing new feature
.

The crucial part of this is what happens to the working copy; assuming that we now land on commit bbbbbb, Watchman will observe changes for all of the hundreds of files that changed across the rebase and pass this information on to the tools that are subscribed or are querying for this information.

If your tooling is source control aware then you can ask watchman to run since queries in a mode where it will return you information about the merge base with main and the minimized set of files that changed.

To enable this mode you issue a query using a new fat clock as the since parameter for the query:

$ watchman -j <<-EOT
["query", "/path/to/root", {
"since": {
"scm": {
"mergebase-with": "main"
}
},
"expression": ["type", "f"],
"fields": ["name"]
}]
EOT

This particular since value starts with an unspecified clock value and requests that watchman run the query in source control aware mode, using the symbolic name main to compute the merge base for the commit graph.

If we look back to the illustrations above and rewind to the first scenario, the results of this query will look something like this:

{
"clock": {
"clock": "c:123:123",
"scm": {
"mergebase": "f12345",
"mergebase-with": "main"
}
},
"files": ["cat.jpg", "cats.cpp"]
}

This result informs the client of the merge base with main (which happens to be main itself) and the list of changes since that merge base.

To get the next incremental change the client feeds that clock value back in to its next query. Looking back to the second illustration above, if we were to run this query after the running hg pull (note that this doesn't change the working copy):

$ watchman -j <<-EOT
["query", "/path/to/root", {
"since": {
"clock": "c:123:123",
"scm": {
"mergebase": "f12345",
"mergebase-with": "main"
}
},
"expression": ["type", "f"],
"fields": ["name"]
}]
EOT

we'd get this result:

{
"clock": {
"clock": "c:123:124",
"scm": {
"mergebase": "f12345",
"mergebase-with": "main"
}
},
"files": []
}

Note that the files list is empty because we didn't change any files, and note that one of the numeric portions of the clock string has changed.

Also note that the mergebase revision remains the same because we also didn't rebase the commit yet.

This is a little white lie: the reality is that some files did change in the version control system, and with the expression we're using we would see them, but they are not part of the working copy so we're omitting them for the clarity of this example.

Now if we rebase and update to the rebased revision (taking us to the last of the illustrations from above), we'd run this query, feeding in the clock from the last query to get the correct incremental result:

$ watchman -j <<-EOT
["query", "/path/to/root", {
"since": {
"clock": "c:123:124",
"scm": {
"mergebase": "f12345",
"mergebase-with": "main"
}
},
"expression": ["type", "f"],
"fields": ["name"]
}]
EOT

we'd get this result:

{
"clock": {
"clock": "c:123:125",
"scm": {
"mergebase": "fabf87",
"mergebase-with": "main"
}
},
"files": ["cat.jpg", "cats.cpp"]
}

Note that the mergebase reported in the clock has changed and note that the list of files reported is just the two from our commit stack despite there being hundreds of files that were physically updated on the disk.

Your client can now lookup some state based on the fabf87 revision and download it, and can then incrementally apply the computation for cat.jpg and cats.cpp on top of that state.

If your client doesn't know how to do this, then you shouldn't use this source control aware query mode!

Source Control Aware Subscriptions​

You can also use the same source control awareness in your subscriptions. This is basically the same procedure as making queries above, but there are some preconditions and things to note:

  • Watchman needs the cooperation of the source control system to know when it should defer events.
  • Source control aware subscriptions implicitly enable defer_vcs and defer:["hg.update"]. As with the point above, this is to ensure that you don't get notified about files changing during the working copy update operation; that would defeat the point of using source control awareness.

To initiate a source control aware subscription:

["subscribe", "/path/to/root", "mysubscriptionname", {
"fields": ["name"],
"since": {
"scm": {
"mergebase-with": "main"
}
}
}]

You'll then receive subscription responses as files change; those responses will contain fat clock values for the since and clock fields:

{
"subscription": "mysubscriptionname",
"clock": {
"clock": "c:1234:125",
"scm": {
"mergebase": "fabf87",
"mergebase-with": "main",
}
},
"since": {
"clock": "c:1234:123",
"scm": {
"mergebase": "f12345",
"mergebase-with": "main",
}
},
"files": ["cat.jpg", "cats.cpp"],
"root": "/path/to/root"
}

The clock field holds the value of the clock and the merge base as of the subscription notification.

The since field holds the fat clock that was returned in the clock field from the prior subscription update. It is present as a convenience for you; you can compare the mergebase fields between the two to determine that the merge base changed in this update. This is an important detail because more files in the working copy have been physically changed than are reflected in the files list; your tooling will need to so something appropriate to ensure that it computes a consistent and correct result.

state-enter & state-leave​

Source control aware subscriptions will always include a fat clock in their responses, however, only the regular clock is provided in state-enter and state-leave notifications. This is because computing the source control information is a non-trivial operation and could increase latency.