Context Navigation

Modify ↓

#3347 closed defect (fixed)

[partial patch]Merging nodes, fixing duplicate nodes, unglueing ways and splitting ways is very slow with large datasets

Reported by:	anonymous	Owned by:	team
Priority:	major	Milestone:
Component:	Core	Version:	latest
Keywords:		Cc:	Firefishy, bilbo

Description

I loaded OSM data from attached file (originally they come from server where someone accidentally uploaded almost identical ways multiple times) and I pressed "Validate" button to start validator. After few seconds, I got results, amongth them almost 1000 duplicate nodes. I selected the duplicated nodes in the validator and pressed "fix". Then 16 minutes later, JOSM is still busy fixing the nodes, cpu running to 100%.

With about 100 duplicated nodes, validator is much faster with fixing (few seconds max.), so I guess there must be some algorithm with O(n^{2) or even worse complexity involved. The validator should be improved, so it can fix even large number of duplicated nodes (1000 is not so large) in some resonable time)}

Attachments (5)

bad_kralupy_nobounds.osm.bz2 (60.0 KB ) - added by anonymous 15 years ago.: OSM dataset used to reproduce the bug
mergenodesstatic-main.patch (2.8 KB ) - added by bilbo 15 years ago.: Make some MergeNodesAction methods static for speedup - main part
mergenodesstatic-validator.patch (1.1 KB ) - added by bilbo 15 years ago.: Make some MergeNodesAction methods static for speedup - validator part
speedup-main.patch (968 bytes ) - added by bilbo 15 years ago.: Speedup patch part 1 (core)
speedup-plugin.patch (926 bytes ) - added by bilbo 15 years ago.: Speedup patch part 1 (validator)

Download all attachments as: .zip

Change History (19)

by anonymous, 15 years ago

Attachment:	bad_kralupy_nobounds.osm.bz2 added

OSM dataset used to reproduce the bug

comment:1 by anonymous, 15 years ago

From further testing, it seems the time is much worse if there are many ways. Seems in cases with 600 duplicated nodes in parts of data with mostly only streets mapped is much faster than fixing 200 duplicated nodes in parts of data where there is more detail (not only streets, but also buildings along them are mapped, thus having several times more ways in the data)

In both cases the nodes to be deduplicated are not part of any way.

comment:2 by anonymous, 15 years ago

I looked at MergeNodesAction.java and I think I see the problem:

for (Relation r : getCurrentDataSet().relations) {
 ...
 for (RelationMember rm : r.getMembers()) {
  ...

For every node fixed, every member in every relation is walked. Luckily, there are usually only few relations in dataset, though they can sometimes contain lot of members, especially some long routes.

for (Way w : getCurrentDataSet().ways) {
 ...
 for (Node pushNode: w.getNodes()) {
  ...

For every node fixed, every node in every way is walked.

And this all is repeated for every diplicated node to be fixed. Ouch!

As a result: 88 ways, 611 nodes in them, another 1200 nodes outside of any ways. 262 duplicate node pairs, none of them in any way. Fixing it needs about 30 seconds. Larger maps = much worse.

I think the membership data needs to be cached somehow for the MergeNodes action to be reasonably fast. With larger datasets, even manually merging single pair of nodes is relatively (~1 sec) slow due to mergeNodes walking through all nodes.

Currently the complexity is O(number_of_nodes_to_be_fixed * number_of_nodes_in_data) =~ O(n^{2), IMHO with proper caching it could get down to O(number_of_nodes_to_be_fixed * log(number_of_nodes_in_data) + number_of_nodes_in_data) =~ O(n log n) at worst (with most of nodes being dupes) or O(n) with few nodes being dupes.}

Such cache could speed up some other things too (Unglue action, Split way when only nodes are selected and getWayForNode used when drawing a line would be considerably faster, especially with larger datasets (from "almost a second" to "unmeasurably fast").

The cache could contain for each node list of ways and relations that contain that node.

With the cache, the time to fix nodes in the above example would go probably way under 1 second.

comment:3 by anonymous, 15 years ago

Component:	Core validator → Core
Version:	→ latest

comment:4 by anonymous, 15 years ago

Summary:	Fixing duplicate nodes in validator is extremely slow → Merging nodes, unglueing ways and splitting ways is very slow with large datasets (as result, fixing duplicate nodes in validator is unusably slow. )

comment:5 by anonymous, 15 years ago

I found some further ways where there is loop "for all ways" - in CollectBackReferencesVisitor, which is called for example in computeNodesToDelete - this walks all nodes in all ways for each node in deleted way. Very bad if you delete way with many nodes in large dataset.

comment:6 by Gubaer, 15 years ago

Ticket #3453 has been marked as a duplicate of this ticket.

comment:7 by Bürste, 15 years ago

Ticket #3285 has been marked as a duplicate of this ticket.

comment:8 by Firefishy, 15 years ago

Cc:	Firefishy added

comment:9 by Gubaer, 15 years ago

Resolution:	→ fixed
Status:	new → closed

I've rewritten MergeNodesAction, see r2095.

The validator plugin is updated too. Batch processing of merge node actions is now more efficient and fixing errors is now an asynchronous cancelable task, see [o17585].

In my environment 1000 duplicate nodes are fixed in ~ 90 secs.

comment:10 by bilbo, 15 years ago

Cc:	bilbo added
Resolution:	fixed
Status:	closed → reopened
Summary:	Merging nodes, unglueing ways and splitting ways is very slow with large datasets (as result, fixing duplicate nodes in validator is unusably slow. ) → [partial patch]Merging nodes, fixing duplicate nodes, unglueing ways and splitting ways is very slow with large datasets

I tried new JOSM (2097) and new validator (17585) and it is still slow, though now there is progressbar, so you can see the progress.

I tried to download part of Prague from OSM (Nodes: 19587, Ways: 2373, Relations: 47), and run Validator.
The file I used for this test can be downloaded from http://git.wz.cz/dedup-testset4.osm.bz2 (408 KB)

I got 864 duplicated nodes, pressed "Fix", and the progressbar is progressing at about 1 node per second. While maybe somewhat faster than before, it is still very slow. The nodes in the test dataset are not part of any way, so I'd expect the lookup into backreferences would quickly discover that and merging two nodes into one shouldn't then take that much long.

Wouldn't it be better to have backreferences as part of the dataset and keep them updated as you edit? This would speed up also manually merging two nodes.

I tested it on athlon64 3000+, while not the best CPU you can have, it is not exactly some ancient piece of junk either.

I discovered, that MergeNodesAction.mergeNodes needs about 1-2 msec.
When I timed DuplicateNode.fixError(), it needed about 11-12 msec, of it 10 msec spend in "MergeNodesAction mergeAction = new MergeNodesAction();" - in the MergeNodesAction I see it tries to re-register the shortcut for the action, etc...

This can be fixed by making some MergeNodesAction methods static, patch attached.

From this timing I'd expect node merging to take 1-2 seconds, not 15 minutes. Seems there is still lot of CPU time wasted somewhere.

In ValidatorDialog$FixTask.realRun I found where all the CPU cycles are wasted:

SwingUtilities.invokeAndWait(
  new Runnable() {
    public void run() {
	Main.main.undoRedo.add(fixCommand);
    }
  }
);

This block needed 500 - 700 msec to run. The line "Main.main.undoRedo.add(fixCommand);" itself needed 150 - 300 msec.

Note that JOSM redraws the data after every fixed node, which probably contributes to the slowdown. When I zoom very close, so viewport shows only small part of the map, fixing duplicate nodes is then about 30-50% faster.

Also, when the nodes to be merged are selected (if you doubleclick entire category with Duplicate nodes in validator, you select all "affected" nodes), merging is significantly slower. With nodes unselected, it progresses approximately twice as fast.

by bilbo, 15 years ago

Attachment:	mergenodesstatic-main.patch added

Make some MergeNodesAction methods static for speedup - main part

by bilbo, 15 years ago

Attachment:	mergenodesstatic-validator.patch added

Make some MergeNodesAction methods static for speedup - validator part

by bilbo, 15 years ago

Attachment:	speedup-main.patch added

Speedup patch part 1 (core)

by bilbo, 15 years ago

Attachment:	speedup-plugin.patch added

Speedup patch part 1 (validator)

comment:11 by bilbo, 15 years ago

I discovered, that undoRedo.add called for each fix applies the command and then does things like updating menus, redrawing, etc ...

I modified it, so when run inside validator, only the commands are updated and the final part (updating menus, etc ...) is run only once after running all the fixes. See two patches attached above.

When applied all 4 patches above (these speedups + making MergeNodeAction calls static), time needed to fix 864 duplicated nodes gets down from about 10-15 minutes to about 5 seconds.

comment:12 by Gubaer, 15 years ago

Wouldn't it be better to have backreferences as part of the dataset and keep them updated as you edit? This would speed up also manually merging two nodes.

Absolutely. This is a kind of longterm goal. There was much of incremental refactoring in the core OSM data classes going on lately. The ultimate goal is there to better encapsulate data manipulation and to transparently provide additional data structures: spatial index, back references.

Thanks for the patches. This is going to improve the situation tremendously. I'll apply them shortly.

comment:13 by Gubaer, 15 years ago

(In [2098]) see #3347: patches by singularita@…: Merging nodes, fixing duplicate nodes, unglueing ways and splitting ways is very slow with large datasets

comment:14 by Gubaer, 15 years ago

Resolution:	→ fixed
Status:	reopened → closed

Modify Ticket

Change Properties

Summary:
Type:		Priority:
Milestone:		Component:
Version:		Keywords:
Cc:	Set your email in Preferences

Action

leave as closed The owner will remain team.

change resolution as The resolution will be set.

reopen The resolution will be deleted. Next status will be 'reopened'.

Add Comment

Your email or username:

E-mail address and name can be saved in the Preferences .

You may use WikiFormatting here.

Attachments ↑ Description ↑

Note: See TracTickets for help on using tickets.

Download in other formats: