Unintended Backend overload detection

ikegam · March 19, 2025, 7:28am

The backoff process during backend overload is implemented using exceptions when writing to InfluxDB. However, even when InfluxDB returns a type exception, the query limit counter still increases. As you know, once InfluxDB creates a value in a field, its type is fixed. Therefore, in a long-term OpenEMS operation with various versions of Edge, data types can become inconsistent.

On the backend side, there is code that casts types when a type exception occurs. The backend also has a filter that modifies the data format to match the type required by InfluxDB. In this kind of environment, when the backend is restarted, if the system is unlucky, the process of registering type casting to the filter can repeatedly increase the query limit counter, eventually leading to a service outage.

To solve this issue, I have created the following PR, but I am not very confident about it. How do most people in the community approach this problem?

github.com/OpenEMS/openems

Improve querylimit handling to avoid unintended increment due to type conflicts

develop ← girasolenergy:fix-querylimit

opened 06:40AM - 17 Mar 25 UTC

ikegam

+25 -19

Previously, the backend incremented querylimit even for type conflicts in Influx…DB queries. Since querylimit should only reflect actual overload cases, this change modifies the handling logic to prevent increments caused solely by type conflicts. I have tested this PR, and it appears to be working as expected. However, since this affects a core part of the backend functionality, I'm opening it initially as a Draft PR for further review and discussion.

In our case, when we recently changed MeasuringEVCS to ElectricityMeter for EVCS, the OCPP /Frequency value changed from float to int. As a result, we had to rely heavily on this backend functionality to handle data from multiple Edge devices.

ikegam · March 19, 2025, 10:40am

Thank you very much to keep clean on this forum!! @sn0w3y

Sn0w3y · March 19, 2025, 7:42pm

Hello,

no Problem - even if i need to say, that i would appreciate that more, if you join the Assosiation while still using OpenEMS Commercially

Greetings!

ikegam · March 19, 2025, 10:46pm

Our company has been in the process of joining the Association. I will now check the current status. Thanks again.

stefan.feilmeier · March 21, 2025, 9:06am

@Sn0w3y Girasol Energy is already a member since 2023. Somehow it seems they are still not yet listed on OpenEMS Association e.V. – OpenEMS.

Also it would be interesting if there was a way to show that Community members are members of the assocation. Maybe a Badge could be granted etc. Need to do some research…

Back to topic: I asked @michaelgrill to review the PR as he has been working on that code recently.

Thanks!

c.lehne · March 23, 2025, 9:41am

Hi ikegam,
I am not sure, that it helps to increase the query limit counter to solve the problem (Note that I did not hat a close look at your code).
We once had probably the same issue. When we restarted the Backend we lost thousands of datapoints within the first hour after restart. The reason was hard to detect, because our logfile looks good, except of a few Typecast-Exceptions.
In our case the problem was, that every time a Typecast-Exception has been occured, all measuring points that are also transmitted with this exception are lost. It turns out that the OpenEMS backend collects more than 1000 measuring points and transmits them to the database in one piece. So everytime we got a Typecast-Exception we lost all this 1000+ measuring points. On a typical backend startup we saw maybe 50 Typecast-Exception in the backend log. But in the end, this has lead to >50.000 lost measuring points.

We solved this, by fixing our code and updating all edges. Still we have a very
few “unsolvable” Typecast-Exceptions. But for them we added some hardcoded, predefined typecast handler.
Now, any new seen typecast exception sets the alarm bells ringing. And we respond with immediate action. Also we try to improve our internal development process to find this kind of problem in the review phase already.

So I would suggest to fix all Typecast-Exceptions as early as possible. In the end this may help you more, than fixing the querylimit counter mechanism.

ikegam · March 24, 2025, 12:43am

Thanks for getting back to me, @c.lehne!

You’re absolutely right to keep consistent types on your Edge. that’s definitely the right approach!

I also want to resolve the type issues from the edge side, but unfortunately, the data in our InfluxDB is already inconsistent. For example, evcs[1...4]/Frequency is stored as float, while evcs[5...10]/Frequency is int. And newer Edges send this as int.

One option would be to rebuild the InfluxDB entirely. but that would be quite a heavy task. Alternatively, we could shift to a new field, but that would increase the number of columns.

I’ll think about it more.
Either way, thanks a lot for sharing your experience. It was super helpful!

ikegam · March 24, 2025, 5:12am

Mar 14 18:24:01 XXX java[3588521]: 2025-03-14T18:24:01,845 [thread-1] INFO [.debugcycle.DebugCycleExecutor] [Timedata.InfluxD
B] timedata0 [monitor] Pool: 8/10, Pending: 0, Completed: 8, Active: 0, MergePointsWorker[Default: 0/1000000], Limit:0.000, Reject
edExecutions:0
Mar 14 18:24:02 XXX java[3588521]: 2025-03-14T18:24:02,073 [fluxDB-8] WARN [red.influxdb.MergePointsWorker] Unable to write t
o InfluxDB. BadRequestException: HTTP status code: 400; Message: partial write: field type conflict: input field “evcs3/Frequency”
on measurement “data” is type integer, already exists as type float dropped=25
Mar 14 18:24:06 XXX java[3588521]: 2025-03-14T18:24:06,845 [thread-1] INFO [.debugcycle.DebugCycleExecutor] [Timedata.InfluxD
B] timedata0 [monitor] Pool: 9/10, Pending: 0, Completed: 9, Active: 0, MergePointsWorker[Default: 0/1000000], Limit:0.100, Reject
edExecutions:0
…
Mar 14 18:28:16 XXX java[3588521]: 2025-03-14T18:28:16,844 [thread-1] INFO [.debugcycle.DebugCycleExecutor] [Timedata.InfluxDB] timedata0 [monitor] Pool: 10/10, Pending: 0, Completed: 34, Active: 0, MergePointsWorker[Default: 0/1000000], Limit:0.950, RejectedExecutions:0

This is a log when it happens. With above patch, it changes like follows.

Mar 16 16:44:42 XXX java[1367630]: 2025-03-16T16:44:42,478 [fluxDB-0] WARN [red.influxdb.MergePointsWorker] Unable to write to InfluxDB. BadRequestException: HTTP status code: 400; Message: partial write: field type conflict: input field “evcs25/Frequency” on measurement “data” is type integer, already exists as type float dropped=1
Mar 16 16:44:42 XXX java[1367630]: 2025-03-16T16:44:42,479 [fluxDB-0] INFO [nflux.FieldTypeConflictHandler] [Timedata.InfluxDB] Add handler for [evcs25/Frequency] from [integer] to [float]
Mar 16 16:44:42 XXX java[1367630]: Add predefined FieldTypeConflictHandler: this.createAndAddHandler(“evcs25/Frequency”, RequiredType.FLOAT);
Mar 16 16:44:42 XXX java[1367630]: 2025-03-16T16:44:42,749 [thread-1] INFO [.debugcycle.DebugCycleExecutor] [Timedata.InfluxDB] timedata0 [monitor] Pool: 10/10, Pending: 0, Completed: 311, Active: 0, MergePointsWorker[Default: 2/1000000], Limit:0.000, RejectedExecutions:0

c.lehne · March 24, 2025, 4:43pm

I know, tidying up is a thankless task. But if you don’t do it you will regularly and cyclically loose up to 1000 measuring points every time a new evcs[5...10]/Frequency will be received.

We had the same issue with evcsX/Frequency. Our solution in that case. Kick all evcsX/Frequency out. We do not need it. If I want to work with the frequency I use the grid meter frequency. I can’t imagine a situation where I need the frequency of an EVCS. Also we had the same issue with FIRMWARE (I think), which was sometimes an Integer, a Float or a String, depending on the EVCS. We cleaned it up and renamed it to FIRMWARE_VERSION (I think) and made it always a String. You are right, this increases the number of columns in the DB. But I found influx handling this really well. I don’t think that a few more columns will become an issue in the future.

So far, we have managed to deal with this problem well. But i’m a bit scared of the day, when we can’t get any further.

ikegam · March 25, 2025, 2:39am

The way we solved it is quite similar. Maybe this patch will also work well in your backend.
I guess the exception happens naturally with InfluxDB and OpenEMS, because the backend already has a function to handle typecast errors.

Currently, the overload detection works by increasing the querylimit counter whenever an exception occurs during a write. My patch simply changes this behavior so that the counter is not increased when the exception is caused by a type error, which is not overload.

Yes, I also feel that /Frequency is not needed for EVCS. That’s also one of the possible ways I’m considering to solve this.

Actually, it’s when we get data for the old evcs[0...4] that the problem happens. The new Edges send /Frequency as an int, but InfluxDB expects a float.

stefan.feilmeier · March 28, 2025, 8:26am

A sidenote, as it does not exactly solve your issue. We primarily rely on the “Timedata.AggregatedInfluxDB” for everything that is displayed to the user. The implementation allows us to predefine the types in the “AllowedChannels” class: openems/io.openems.backend.timedata.aggregatedinflux/src/io/openems/backend/timedata/aggregatedinflux/AllowedChannels.java at develop · OpenEMS/openems · GitHub

Whenever “AggregatedInfluxDB” cannot serve a request, “Core.TimedataManager” forwards the request to the other InfluxDB service.

ikegam · April 2, 2025, 7:38am

This approach is a much better way to handle a single controller across all channels, as it greatly simplifies maintaining type integrity.

Perhaps we should consider deploying this aggregated InfluxDB and migrating to the new server.

However, since many users, as shown in this thread, still depend on raw InfluxDB, it might be beneficial to implement a backend solution to address this issue.