Staying updated with aggregated data

Just got off a great call with @paul121 @gbathree @AmberS and others at Our Sci about strategies for keeping aggregated data up to date. I wanted to share some of the ideas and discussions that came up in that call with the wider community to elicit feedback and share ideas.

Our Sci is building a “Digital Coffee Shop”, which aggregates data from many farmOS instances together (via the farmOS Aggregator) to create data benchmarks and visualizations. They are periodically checking each instance for new records, and pulling them in to their own custom MongoDB cache, where they are formatted for use in the Coffee Shop.

One of the challenges of aggregating large amounts of data is keeping it updated as changes are made in all of the farmOS instances. And as this scales to more farms, the challenge scales with it.

Two farmOS feature ideas were proposed to help with this:

  1. farmOS could provide a dedicated API endpoint that lists entities that were added/updated/deleted recently, filterable by date. This would improve the efficiency with which external systems (like the Coffee Shop) are able to determine what they need to update on their end.
  2. farmOS could send webhook notifications to an endpoint in the external system to notify it immediately of changes.

(1) reminds me of some earlier discussions we had around maintaining an “audit log” of changes to records in farmOS. With farmOS v2, we now have “revision logs”, which maybe cover 80% of these needs, but they miss some things. For example, there is no way to see that records have been deleted.

We talked about different approaches to this. A simple approach might be to just query the log and asset (and other) tables to look at created and changed timestamps. Another would be to add a new dedicated database table that tracked everything in one place (high level audit trails - not specifics that are already captured in revisions). This could also capture deletions, and be query-able by Views in a single SQL query (and also exposed as an API endpoint via Views pretty easily).

(2) reminds me of recent work and discussions with @Farmer-Ed @paul121 around the “Notifications” module in farmOS. We recently took a first step towards generalizing that module out of the “Data Stream Notifications” module. Currently there is a single “notification type plugin” for “email notifications”. Perhaps “webhook notifications” could be another type.

@paul121 raised some very good questions about security and information disclosure considerations with webhooks, so we should think carefully about that. One potential approach would be to make the webhook notification config entities configurable, so you could specify what information is included in it. Some might just say “something updated”! Others might include UUID and entity type/bundle information. Still others might send the entire entity as JSON in the webhook request. @gbathree also made the point that farmOS users should be aware of everything that’s happening behind the scenes like this, so that there is visibility.

We also talked about some ideas around adding caching layers to the farmOS Aggregator itself, although I feel like that should be a separate forum topic. :slight_smile:

1 Like

It would be great to see a few notifications plugins developed. I had created a module to send notifications via http requests, but it needs a complete redesign to fit in with the new Notifications Module, also it was based on hook_ENTITY_TYPE_update() and worked fine for entitles updated through the UI but not from the API (Not usually a major issue as I’d handle the notification from the external script if needed). I was going to investigate if an eventListner would cover this scenario better, but you probably already know the answer @mstenta?

1 Like

That’s interesting @Farmer-Ed - I wasn’t aware of that limitation of hooks (although I have a vague memory now of you bringing it up in the farmOS chat). If that’s true, then it could be a blocker for the webhook idea more generally, since right now our asset and log event listeners are getting triggered by those hooks too.

See all four hooks implemented for asset entities here: farmOS/asset.module at 268be488e0c8cf542a5a93ecddef8908ec9a30db · farmOS/farmOS · GitHub

(And notice that they all have an @todo linking to the upstream Drupal core issue here: https://www.drupal.org/node/2551893)

I would expect those hooks to be invoked when entities are modified via JSON:API… because I would expect that JSON:API is using the lower level Entity API methods. Maybe we can take a look at this together on today’s dev call.

1 Like

Our Sci’s downstream Coffee Shop issue for reference: Dedicated API endpoint in FarmOS (#17) · Issues · our-sci / software / coffee-shop · GitLab

@Farmer-Ed FWIW it does look like JSON:API is using the lower level Entity API methods… so I would expect the hooks to work. Let’s take a look together sometime!

Create:

Update:

Delete:

1 Like

Ok, must give it another look so, entirely plausible that there is another issue of my own making. Either way I’d like to get it working properly.

1 Like

This!

I think most of what you’re talking about can be achieved via JSON:API by querying all the asset/log types sorted by changed date. Subsequent requests would also filter by whether the changed timestamp is greater than the largest previously observed change timstamp.

Deletions are the main thing that misses and they could be (partially) observed by periodically doing a scan of all the asset/log entities. This can be made more efficient by only getting the entity ids. e.g. https://farmos.test/api/asset/seed?fields[asset--seed]=id

2 Likes

We talked about this more during the dev call today. I want to highlight one thing @Symbioquine mentioned re: webhooks using a simple hook_entity_update approach: this would mean that all write requests would not return until the webhook request is complete. This could cause things like a form submission to become slower in the UI, as well as slower write operations via JSON:API.

For example, editing an asset:

  • User completes an asset edit form, clicks “save”
  • Drupal invokes a hook_asset_update function that sends a synchronous request to a webhook
  • Drupal waits for this webhook request to get a successful response
  • Form submission completes and the user sees the updated asset page

This becomes a larger issue if there are multiple webhooks that need to be notified of the change to this asset. One improvement could be to send asynchronous requests and not check for a successful response… but then it is possible for webhooks to fail and not be attempted again.

A proper solution would be to add webhook requests into a queue that runs separate from the form submission or JSON:API write operation. This queue could be processed at a regular interval (5m, 1h, 1d) or processed via a separate “worker” process more real-time. When processing the queue failed webhook requests could be retried with different strategies.

The issue with a queue is that this would require additional configuration of the farmOS environment to process the webhook queue. Drupal core has an API for queue operations. The simplest way to process these queues is via cron where a module can implement hook_cron and claim & process items from a given queue. But this requires that drupal cron is run at the desired interval webhooks would be processed.

That said, it does appear that these Drupal queue operations support other/custom backends, perhaps other more traditional task queue frameworks. However, these backends would likely require adding additional services/workers/processes to the farmOS server environment. I can’t find proper documentation on this but there are various blog posts, this one being more recent and helping me understand what I described above: Queues in Drupal 8 and 9 | Blog | Alan Saunders - Carlisle Drupal / PHP Developer

We identified that it might make the most sense to have another service (like the aggregator) poll the farmOS server for updates, and have this service dispatch update information via webhooks to external services.

@Symbioquine you had a lot of good points/ideas regarding queues eg: “it’s important for (resilient) queues to have their own storage mechanism” - feel free to add if you think I missed anything :slight_smile:

3 Likes

Yeah, I probably said this in a weird way. Queues are a storage mechanism, but the point I was trying to make was more about the API a queue has to have in order to guarantee successful at-least-once processing. Normally, this takes the form of the queue consumer having a “take”, “act”, and “ack” flow which queue items are not fully removed from the queue until they have been ack’ed (following successful processing).

I also mentioned that at-least-once processing is not necessary if there is some sort of reconciliation mechanism for lost events. In other words, the webhook mechanism we’re discussing here could potentially be a “best effort” mechanism and there could still be a periodic polling mechanism to catch any lost changes. If that were the case, the webhook calls could be pretty aggressive about timing out and not being retried. Parameters for the timeout and polling frequency would control the likelihood of the external system being out of date and for at most how long.

3 Likes

Will Gardiner shared this blog post with me, relevant to this discussion: Give me /events, not webhooks

3 Likes

That’s a very nice write-up about why polling is (almost always) better!

2 Likes

Maybe we should implement ActivityPub for the “audit log” :wink:

(I am not necessarily recommending this - just reading/learning and the overlap with this discussion came to mind.)

1 Like

I just skimmed the w3.org protocol docs, but it seems like an extremely stateful protocol. (As it probably needs to be for its target problem domain.)

I guess the arguments for it would need to be along the lines of interoperability (with existing services/implementations) and anti-bike-shedding since it appears to involve all the cons we discussed already in this thread in-terms of per-subscriber queues and callbacks to push events to subscribers - as opposed to having subscribers poll for events from a less stateful (with regards to event propagation) server.

1 Like