Workshopping the farmOS Data Model, Conventions, and Schema Formats

The following use case builds on a conversation from yesterday with @paul121 and @OctavioDuarte, which was partly prompted by the CUE data validation and schema definition language and partly by a discussion around conventions in general and Our Sci’s mirror_farmos. We discussed convening some workshops for anyone else with a use case like the one below, preferably with their requirements defined ahead of time and a couple of concrete examples we could test out. We can work on documenting them and thinking through possible solutions, maybe try implementing some things. @OctavioDuarte also suggested we start with some JSON Schema and experiment with converting it to other notations, like CUE or SHACL, using some of the automation tools that are already available. I think that’s a great idea. We could also try modeling some entities by hand, either in accordance with the farmOS Data Model or as independent models that would maintain compatibility with it.

Testing Location Logic by Implementing the farmOS Data Model in Node.js

In farmOS.js, I replicated a fair amount of the farmOS Data Model, or at least, the parts that can be reconstructed from JSON Schema alone. With farmOS.js as its primary syncing engine, Field Kit is able to sync its own records with any farmOS server, regardless whatever additional modules the server may have installed and however that may impact the way entities are structured. Because farmOS.js can also be used in Node.js, not just the browser as with Field Kit, it could theoretically be used with its own database connection and an Express web server as a naive, implementation of farmOS. Basically you could run a “headless” clone of farmOS, which is similar to what @OctavioDuarte has done with mirror_farmos.

However, there would still be some unresolved issues with data integrity if someone was relying exclusively on such an implementation, without a regular farmOS server somewhere in the loop to reconcile certain entity fields from time to time. The best example I know of to illustrate the kind of reconciliation that’s required but not represented by the JSON Schema, is the location logic. Without that logic to update an asset’s present location and geometry in a consistent manner, you could end up with the same asset in two non-overlapping locations or other issues. The inventory logic and group membership logic would pose similar issues, and probably other undocumented procedures for maintaining the data integrity.

So my main ask, which I think is representative of this general problem, is this: How can farmOS location logic be represented in serialized form, so that it could be replicated on another device, shared over the wire, and versioned in the event of future changes to the location logic’s main algorithm?

I see two general approaches for addressing this:

  1. Serialize the location logic itself, in the form of some kind of RPC or query syntax or even raw code.
  2. Make the concept of locations and/or movements into an independent data structure (call that an entity, an object, or what you will), so they can be referenced separately from both the asset it refers to and the log that records when and how the action took place.

I probably favor the second option at this point, but obviously that would be a breaking change, so it can’t really be expected that farmOS itself would adopt that model outside of a major version upgrade. That is fine for my purposes. I’m confident a Node.js server and database built this way would still be able to connect with a regular farmOS server to exchange data, since farmOS.js is already does this pretty well as a far less robust conflict resolution.

There are still other aspects of the logic that this may not address, but for the most part, farmOS comes pretty close to an event sourcing architecture, so I’m mainly looking for places where that could be fully realized. By separating certain fields like geometry and is_movement from standard logs and into more atomic events that could reference the asset and then, in turn, be referenced by a log, I think it could greatly simplify the logic required to materialize the state of a given asset’s location from those records, with easily reproducible logic that could be applied universally. Perhaps in some instances, approach #1 could be employed, but with simpler, more generic logic that was less stateful, and perhaps in combination with #2, something akin to AT Protocol’s Lexicon.

Or maybe none of this is really necessary. Maybe it’s enough if the relevant fields can be adequately specified, isolated and protected against overriding configurations. And if it can be guaranteed that the core logic won’t be subject to arbitrary changes, like with the addition of a farmOS/Drupal module, I’d be pretty happy about that. I’m looking now at ActivityPub’s spec for server-to-server interactions and occurs to me the current farmOS logic specifications aren’t all that different. I’m not committed to any one approach, but I think the best way to find out is to try implementing this with special attention given to the location logic to see how it handles potential conflicts.

Documented decisions regarding locations

Some relevant links on the decision to make locations a type of asset entity in farmOS 2.x, which were previously represented as taxonomy terms in 1.x:

4 Likes

It seems like a good distinction to make here would be to ask the question: which attributes/relationships in the farmOS data model are “computed”?

The location and geometry fields on asset entities are an example of computed fields. They cannot be directly edited or updated like other attributes or relationships. Their values are derived from other records. In this case, they represent the “current” location/geometry of an asset.

Other examples on asset entities are are inventory (which summarizes the asset’s current inventory levels) and group (which shows the group that the asset is currently a member of).

Logs don’t have any computed fields, at least that I can think of off the top of my head… :thinking:

A key point to make here, which needs to be understood before even digging into “serialization of the logic” is: this logic requires that ALL data is stored on the other device. Inventory cannot be computed if you don’t have ALL the inventory adjustment logs, for instance. Location and group membership only require the most recent movement log, but there needs to be a guarantee that those are available otherwise you could end up with an incorrect computed value, even if you you had the same logic.

But, as long as you have a complete dataset, then in theory replicating the logic shouldn’t be difficult.

This is precisely why we document the logic for all of these things in farmos.org/model/logic/… so that other applications can replicate it if they need to. It’s not perfect (eg: the group module modifies location logic, which is not currently documents), but it’s a minimum. :slight_smile:

I don’t full understand this. The log has other necessary information like timestamp and status and the reference to the asset… all of which are key pieces of the logic. So I would say the log already is the “separate entity” you describe. But maybe I’m not following the idea…

Very curious to understand this! It seems that we already have an atomic model with logs as the events that get aggregated to compute “current state” (eg: current location, current inventory, current group, etc).

I haven’t really seen a description of what the “problem” is that we’re trying to solve. I think I can guess at it, but it might help to spell that out more explicitly in this thread with some examples.

3 Likes

I’m also curious to hear what data structure you’re imagining for approach #2 @jgaehring! It’s not (probably naively) obvious to me that an independent data structure changes the problem at all. Maybe some class diagrams of what you’re imagining like that Fowler Event Sourcing article would help show what you’re thinking of structure-wise?

Nit pick: “Inventory [for a given asset] cannot be computed if you don’t have ALL the inventory adjustment logs [pertaining to that asset since the last reset adjustment for each unit/measure pair].”

In practical terms this means that an application like Field Kit or Asset Link can compute reasonably accurate inventory changes offline if it keeps track of whether it has previously queried all such logs for a given asset. (Unfortunately, I don’t think the reset part yields any useful optimizations since there’s no way to query for that against unique pairs of unit/measures.)

Obviously, it cannot know in an offline scenario whether new logs (with relevant inventory adjustments) have been created by other actors, but displaying an inventory result offline that does not include those is probably the expected/intuitive behavior.

It also is very cheap to later check whether any new relevant logs exist to a given inventory computation by querying those logs again using the changed timestamps with a page size of 1.

(Sorry if that is a bit of a distraction, but I think it helps illustrate the kinds of things an applications like these need to be aware of to operate efficiently offline.)

2 Likes

Blah, sorry I didn’t get a chance to respond sooner! I’ll try to respond where I can and continue following up on the rest tomorrow…

This, right here. :point_up: Yea, I think that is sooooo important and would likely need to be represented in any formal specification that addresses my core question about serializing the logic.

I think this should be a rule, actually — ie, asset fields can be computed with fields from logs as parameters, not vice versa. Otherwise you could endlessly recurse or some other weirdness. Or you’d have to add specifications about what fields on what entities could be used to compute what other fields on other entities and it just gets to be a mess. It also permits more stringent adherence to the whole event sourcing model (more below).

This is where I think event sourcing could really help out, maybe borrowing a little from CRDTs too. If you can represent the state of any given computed field on an asset at a particular point in time, where a fixed collection of logs have been used to essentially generate a materialized view of that field, then you can cache that state and shouldn’t have to retain all records prior to that point. You can hash the relevant values of the logs (and any other values) used to compute it to make sure that’s the latest save-point or just compute values on top of it for a certain duration. That depends on a few assumptions, but I think they’re assumptions we can get away with them. First being, if the logic by which you compute those views is commutative and associative, you’ve got some assurance that any updates computed on top of the materialized view will be equivalent to updates computed on the entire database. I’m re-familiarizing myself with the literature around that, reading this currently to get reacquainted with state-based CRDTs (aka, convergent or CvCRDTs), which I think is most suitable to this case. Then there’s the issue of revision history, but I think that can be reconciled just by including the revision ids/timestamps in the data used to compute the hash. So just using the Last-Write-Win (LWW) approach, which can also work as a CRDT. Maybe add some other tricks to make that more efficient, like backtracking a given amount of time from the last save-point to ensure if an update is posted retroactively, not too deep in the past, it can still be recomputed without needing to re-fetch old records from somewhere else.

1 Like

Agreed on all points!

If all you want is the current state, this is true. If you want to see what the state was in the past, or will be in the future, then you need all the logs. It all depends on what the application using the data wants to do. :person_shrugging:

1 Like

Coming back…

When I look at the JSON Schema for a log type, it seems like too much crammed into one entity to represent an atomic event. I’ve also had occasion to think a particular log really ought to be multiple logs. Maybe that’s the result of logs being used incorrectly, or maybe it’s not fair to judge that as too much for a single event. Either way, I wouldn’t advocate throwing away any of those fields, just that the log be composed of smaller units that only concern the movement logic, or the inventory logic, etc. And again, not that farmOS itself needs to do it that way, just that an alternative data model could be specified that would still be compatible with farmOS.

Thinking about it now, it may be less about serializing the logic and sending it over the wire, and more about encapsulating the data that a given logic operation depends upon. Like, if I’m moving some material from one storage location to another, and recording that in an activity log, the movement logic itself doesn’t depend on the log’s quantities field. The quantities might be significant or relevant to both the activity log and the material asset it references, but a change to the quantity shouldn’t change the outcome of computing the asset’s location, not so long as the log’s geometry and location fields remain the same. Nor should the log’s geometry or location effect the logic for how the asset’s inventory is calculated. That’s probably not a problem most of the time, but if you have a log’s timestamp being modified after the fact, or their status changes with regard to one relationship but not another, or just some notes get added to the log that doesn’t actually effect anything but ticks the revision history’s timestamp, that seems to all come at a performance cost at the very best, and may lead to errors or incorrect computed values at worst. These are perhaps edge cases, but still seem important. Sure, you can put more guards around that type of behavior, but then are you accurately able to specify that in the model’s logic? Maybe so, but I’m legitimately curious to know how.

Perhaps I should drop the issue of serialization and restate the problem like this: If you take a complete copy of a farmOS instance’s data, and you independently calculate all the locations of every asset based on every movement log, all the quantities based on all the inventory adjustments, all according to the farmOS data model’s logic specifications, can you be guaranteed to get the same results as the original farmOS instance?

I feel pretty confident you would get the same results, at least most of the time. But how much of the time? And what amount of variance is tolerable? There’s probably no definite answer to these questions, but it comes back to what I was originally proposing with this topic: I’d like to make a thorough-going attempt to replicate those results using different data/schema/IDL formats, then generate some data and run comparisons.

Maybe nothing surprising would come out of such a workshop, but if done as a group, I’d still be interested to see what we came up with.

2 Likes

It should be 100%. If not, you either have a bug in the calculation logic or a bug in the mechanism used to make the copy.

In some scenarios you could get incorrect intermediate results while the data is being copied (or maybe shortly thereafter), but that’s the beautiful thing about the calculations - the calculated values are not stored anywhere so if the copying process is eventually consistent, so are resulting calculations.

An alternate implementation would almost certainly need to look at the code to arrive at the same logic though. For example, the detail that log sorting falls back on the id (a.k.a. drupal_internal__id) for the purposes of inventory/movement/grouping log sorting. And of course that would need to inform the copying strategy too - i.e. if any logs have the exact same timestamps, it is necessary to preserve the drupal_internal__id when copying the data, not just the UUID ids.

1 Like

It should be, but is it? This is why I want to try.

Assume it’s a compete copy of the database. And I don’t think that it can be assumed the bug is in the alternative implementation and not the original or in the spec.

To me this implies the specification is incomplete. Or the farmOS implementation itself does not comply to the spec.

Devil’s in the details. How could irregularities like that be incorporated into the spec, and/or made more transparent and predictable in the farmOS implementation?

I’m really not so interested in speculating on what should or shouldn’t be the results of one implementation or another. I want to see for myself. So if others are interested in actually setting up a workshop to try this stuff out, though, please let me know!

Yeah, that was more or less the whole point of my last paragraph.

I agree those things should be in the documentation. I think we’re just waiting on somebody with the free time and motivation to make that happen.

Sounds like a good idea. If you have an idea where there’s a bug in farmOS - and have the time - it would be valuable to nail it down and report it.

Where you’re losing me is why that process should be a synchronous or real-time group activity.

I’m happy to provide support asynchronously (here maybe) if somebody gets stuck - or reviewing those doc changes - but the actual bug finding presumably involves a lot of pulling images, destroying/recreating environments, copying data, etc. i.e. waiting around. Probably not the best use of high-cost group activity time.

Sorry to be a downer, I’m probably just missing the point of the activity you’re proposing. Maybe I’d be more excited about it in practice…