Opened 6 years ago
Last modified 4 years ago
#17388 new enhancement
[WIP PATCH] gpx_distance() is slow when there are many gpx points
Reported by: | taylor.smock | Owned by: | team |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Core | Version: | |
Keywords: | mapcss | Cc: |
Description
I added a bunch of gpx tracks to josm, and then ran a validator.
The total distance covered was in excess of 400 km (so probably about 13,000 gpx points and around 180,000 primitives for around 2,340,000,000 operations per gpx_distance()
call) and it was a conglomeration of several different tracks.
I'm thinking that a possible solution would be to make bounding boxes for the gpx tracks and iterate through them to find the closest bounding box (probably find the bounding boxes that overlap with the closest bounding box) and iterate through the bounding boxes.
The reason it is slow is due to an n2 algorithm (for each osm primitive, we have to loop through the gpx tracks at least once).
Workaround: Reduce the number of osmprimitives/gpx points that are loaded at any one time.
I probably should have tested something like this, but I didn't think to do so.
Attachments (8)
Change History (30)
by , 6 years ago
Attachment: | GpxDistance_speed.patch added |
---|
by , 6 years ago
Attachment: | GpxDistance_speed_v2.patch added |
---|
Working patch (VERY SLOW), but it does cut down on the iterations significantly in many cases (I don't know where the slowdown is).
comment:1 by , 4 years ago
Is this still an issue?
I added a bunch of gpx tracks to josm, and then ran a validator.
I don't know any test that would care about gpx layers. Please give more details.
I don't know where the slowdown is
You can use VisualVM to find out.
comment:2 by , 4 years ago
I suspect it is still a problem, but it only ever shows up when a test looks for distance to nearest gpx point. I originally wrote this to help me quickly go through street level imagery to add missing information.
For example, if I'm adding surfaces or lanes to roads, I might have a test that looks something like this:
way[highway=~/^(motorway|trunk|primary|secondary|tertiary|residential|unclassified)$/][!surface][gpx_distance() < 30 || gpx_distance() >= 50000000], way[highway=~/^.*_link$/][!surface][gpx_distance() < 30 || gpx_distance() >= 50000000] { throwWarning: tr("Missing surface on a road"); } way[highway=~/^(service|living_street)$/][!surface][gpx_distance() < 30 || gpx_distance() >= 50000000] { throwOther: tr("Missing surface on a minor road"); }
(Yes, that is a test that I have used in the past).
I'll double check this, but the workaround was easy enough that I never was able to prioritize this bug. I'll see if I can work this in somewhere, just so this is actually fixed. I think this may have also have been ameliorated by reducing the number of gpx points in the generated track (we have a branch of the tasking manager which lets us create tasks based off of Mapillary sequences).
follow-up: 5 comment:3 by , 4 years ago
Ah, yes, looks like a lot of iterations. I have no idea why gpx_distance() is used here but that's a different problem.
VisualVM claims that most of the time is spent in org.openstreetmap.josm.data.osm.INode.getEastNorth()
. I wonder why this method is called instead of org.openstreetmap.josm.data.osm.Node.getEastNorth(Projecting projection)
which uses the cached value.
comment:4 by , 4 years ago
Yes, in fact, it seems to be the calculation of the eastNorth values for the node created from the gpx data. This calculation uses Math.atan() wich is very slow.
comment:5 by , 4 years ago
Replying to GerdP:
Ah, yes, looks like a lot of iterations. I have no idea why gpx_distance() is used here but that's a different problem.
VisualVM claims that most of the time is spent inorg.openstreetmap.josm.data.osm.INode.getEastNorth()
. I wonder why this method is called instead oforg.openstreetmap.josm.data.osm.Node.getEastNorth(Projecting projection)
which uses the cached value.
It looks like INode.getEastNorth
uses getEastNorth(ProjectionRegistry.getProjection())
. So it should be cached.
Yes, in fact, it seems to be the calculation of the eastNorth values for the node created from the gpx data. This calculation uses Math.atan() wich is very slow.
That makes sense. :(
by , 4 years ago
Attachment: | 17388.projection_cache.patch added |
---|
follow-up: 7 comment:6 by , 4 years ago
I tried something similar and it was even slower. It would be good if you could avoid the creation of the node instance.
comment:7 by , 4 years ago
Replying to GerdP:
I tried something similar and it was even slower. It would be good if you could avoid the creation of the node instance.
Fair enough. I'll still double check that, but it would be good to avoid the create of a node instance anyway. That was just (literally) first thing that came to mind when you indicated that the Node.getEastNorth
was getting hit a lot.
I'll make some modifications in Geometry next so that an eastnorth or latlon can be sent in. I'll have to refresh my memory of the code there first.
I'll also get a "base" runtime with the current state, the state with the patch I just added, and then whatever comes next. I think I'll upload both the gps file and the osm area to make things a bit easier to reproduce.
comment:8 by , 4 years ago
Patch | First Run | Second run |
No patch | 2h 57m 27s | N/A |
17388.projection_cache.patch | ||
17388.geometry_sqrt_cache.patch | 59m 39s | |
17388.geometry_sqrt_cache_valid.patch | 29m 25s |
I'll be updating this table as things finish running.
Notes:
- GPX layer has ~826000 points.
- First run of 17388.geometry_sqrt_cache_valid.patch had a debugger attached and was paused for some time
by , 4 years ago
Attachment: | 17388.geometry_sqrt_cache.patch added |
---|
Avoid new node, cache east north (only affects future runs), modify Geometry to avoid unnecessary Math.sqrt
comment:9 by , 4 years ago
@GerdP:
The documentation for EastNorth indicates that it is immutable ("This class is immutable"). Do we need to calculate the isValid
every time isValid
is called, or can we cache the result? Or better yet, calculate on creation, and store the boolean value at that time.
About the only reason I can think of to not cache the result or precalculate it would be memory usage. boolean
isn't well defined for size, and byte
s are 8 bits, so worst case scenario, each boolean
would increase memory usage by a byte.
The reason I'm asking is that most of the time is now spent in Double.isNaN
and Double.isInfinite
(both are used in EastNorth.isValid
).
TBH, Don-vip might be a better person to ask for that (he's the only one who has touched that function, as far as I can tell).
comment:10 by , 4 years ago
I don't think that memory is a problem if you talk about class EastNorth
. There should not be many instances.
Node has its own fields for the values.
by , 4 years ago
Attachment: | 17388.geometry_sqrt_cache_valid.patch added |
---|
Precalculate isValid for EastNorth, add new function that takes maxDistance for getLowestDistance in GpxDistance, modify BBox to have a bounds
method that takes a ILatLon and deprecate the method that uses LatLon, max gpx_distance as deprecated while adding one that takes a distance, avoid unnecessary Math.sqrt in Geometry
comment:11 by , 4 years ago
Your patch will decrease performance in other situations where isValid() is not used. I would simply cache the result.
I am still trying to understand what the value returned by gpx_distance() means in a mapcss rule. I have a bunch of gpx tracks from my cycle tours. I load them into JOSM and download an area near my home, so that many ways are "covered" by gpx data. Is that the scenario we are talking about? I don't see how the method detects if I cycled along an OSM way or if I just crossed it a few times.
comment:12 by , 4 years ago
I would simply cache the result.
Hmm, probably not that easy with an immutable class. Would a new class EastNorthValidated
solve this?
comment:13 by , 4 years ago
I wonder if getEastNorth() could/should be changed to return null instead of a new EastNorth
instance with invalid data. A null check is much easier to handle.
Edit: Wouldn't help. We have lots of methods which calculdate east/north values to create new EastNorth
instances.
by , 4 years ago
Attachment: | 17388.4.patch added |
---|
Avoid calculating validity until isValid
is called, at which point caching may occur.
by , 4 years ago
Attachment: | 17388.5.patch added |
---|
Same as attachment:17388.4.patch, except with two booleans
by , 4 years ago
Attachment: | 17388.6.patch added |
---|
Same as attachment:17388.4.patch, except with one Boolean
comment:15 by , 4 years ago
Replying to GerdP:
Why don't you use a
Boolean
?
Short answer: I am pretty certain that the number of EastNorth
objects can easily reach one million, so I'm attaching a few patches with different methods prior to running them under VisualVM.
Depending upon how the VM does things, I would expect 17388.6.patch to be the most efficient (they all refer to the same Boolean.TRUE
or Boolean.FALSE
instance, but then there is the size of the pointer to the Boolean
instances). If that is not the most memory efficient, I would then expect 17388.5.patch to be the most efficient, unless the size of boolean
is large, in which case, 17388.4.patch
would become the "most" efficient, which would be surprising. But we have a Java Bugs page for a reason.
follow-up: 17 comment:16 by , 4 years ago
See e.g. http://java-performance.info/overview-of-memory-saving-techniques-java/
It probably doesn't matter which version you use unless you add more fields to EastNorth
.
I am pretty certain that the number of EastNorth objects can easily reach one million
That would be bad. My understanding is that class EastNorth
is used to pass arguments between methods. Which classes allocate EastNorth
instances?
comment:17 by , 4 years ago
Replying to GerdP:
See e.g. http://java-performance.info/overview-of-memory-saving-techniques-java/
It probably doesn't matter which version you use unless you add more fields toEastNorth
.
I am pretty certain that the number of EastNorth objects can easily reach one million
That would be bad. My understanding is that class
EastNorth
is used to pass arguments between methods. Which classes allocateEastNorth
instances?
For whatever reason, I assumed that WayPoint/Node cached the created EastNorth value. Since they both create a new EastNorth value, I'm now wondering how the runtime was reduced.
Probable locations of runtime reduction:
- Avoiding
Math.sqrt
where possible - Using the method that gives a maximum distance to search
As far as your link goes:
Prefer primitive types to their Object wrappers. The main cause of wrapper types usage are JDK collections, so consider using one of primitive type collection frameworks like Trove.
and
Numeric wrappers occupy 12 bytes plus size of the underlying type. Byte/Short/Character/Integer/Long are cached by JDK, so the actual memory consumption may be smaller for values in the range [-128; 127]. Anyway, these type may be the source of serious memory overhead in the collection-based applications:
are the truly relevant sections. Boolean
is probably cached, but may or may not be.
In any case, without actual caching of the EastNorth
values, it becomes largely pointless to try to cache isValid
. Maybe weakreferences?
follow-up: 19 comment:18 by , 4 years ago
Sorry, I didn't try the last patches because I think that the calculation itself is rather useless.
I don't understand what the distance between an OSM way and the points of a collection of gpx tracks means. Instead of optimizing the details it might be better to avoid the computation when the result is not usable.
comment:19 by , 4 years ago
Replying to GerdP:
Sorry, I didn't try the last patches because I think that the calculation itself is rather useless.
I don't understand what the distance between an OSM way and the points of a collection of gpx tracks means. Instead of optimizing the details it might be better to avoid the computation when the result is not usable.
The actual code uses both gpx layers and geoimage layers. Its just that everything refers to Gpx. There is a faux GpxLayer created for GeoImage so that it can also be used with this function.
So a better example would be driving/biking around taking time-lapse pictures, and then using the example validator.mapcss to add surfaces/lanes to OSM.
EDIT: So, instead of using potentially outdated overhead, you can quickly check roads that may be visible in the photos.
comment:20 by , 4 years ago
OK, now I understand what this is about. Did not make the connection to pictures so far. So, each gpx point represents a picture which possible shows an area of x meters around the point and you try to calculate OSM ways which are within this radius. If many gpx points exist it might be faster to calculate once the area covered by the circles around the points and find ways which are intersecting this area (or the opposite). This could be done by calculating a small halo along the way like the renderer does and use something like java.awt.geom.Area.intersect(Area rhs)
. But these calculations are also very slow when the areas are complex, so you may have to use some simplifications to reduce the complexity of the polygons.
Anyway, I'll try 17388.6.patch with and without the changes in EastNorth now.
comment:21 by , 4 years ago
My results:
- Without the changes in
EastNorth
the performance is better. - The patch improves significantly the performance compared to unpatched version when using the rules from comment:2 (1.7 secs instead of 4.9 secs with my small test set)
I don't know how to use the new mapcss function. Please let me know the modified rules.
comment:22 by , 4 years ago
reg. the "enlarged bbox": see also code in UnconnectedWays.MyWaySegment (dist is given in meters):
BBox bounds = this.getBounds(dist * (360.0d / (Ellipsoid.WGS84.a * 2 * Math.PI)));
Something like this should work for a complete way:
double extraSpace = dist * (360.0d / (Ellipsoid.WGS84.a * 2 * Math.PI)); BBox bbox = new BBox(); bbox.addPrimitive(way, extraSpace);
Untested patch that adds a method to get the distance between the centroid of an
OsmPrimitive
and aBounds
, it also starts work on finding the closestBounds
.