Modify

Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#16734 closed defect (fixed)

JOSM stuck in event loop for multiple seconds

Reported by: stephankn Owned by: team
Priority: normal Milestone: 18.09
Component: Core imagery Version: latest
Keywords: cache performance Cc: wiktorn

Description

JOSM is stuck in the event loop for multiple seconds. See the attached YourKit profiler snapshop and the long events report in the threads tab. For size reasons only the last block of 8s has call stacks.

I can reproduce this by panning to a fresh area, then waiting a while (I suspect 300s to remove threads). When I then change the Zoom level and Pan, then the GUI is frozen for quite long.

I suspect this is more significant as I have a large map window on a 4K screen and due to a fast internet connection configured max 12 concurrent tasks per host, 64 total. Increasing the total max might make it worse.

If I get the stack right, we are inside the painting loop in
org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer.drawInViewArea(Graphics2D, MapView, ProjectionBounds) AbstractTileSourceLayer.java:1514

What is the reason that here tiles are loaded and not simply painted? If tiles are available, they can be painted. If tiles are due to be loaded this is something to expected to take longer, so why not offloading this to a later point in time?

Or maybe it was intended to do so? The (before mentioned) JCSCachedTileLoaderJob.submit() might do this. But according to the profiler this method then takes away 71% of the time in the painting loop. In my case multiple seconds.

Later on a lot of time is spent here:
HostLimitQueue.java:143 <...> java.util.concurrent.ThreadPoolExecutor.setCorePoolSize(int)

which seems to be part of a "trick" to exploit a side effect of the method. Looks like this was implemented by wiktorn.

I believe when the block happens I see a lot of thread creations and destructions happen in a short time. Doing this synchronous in the painting loop seems to cause the delay.

The following snippet os a call stack of what I believe a single call.

"Name","Level"
"AWT-EventQueue-0  Runnable CPU usage on sample: 234ms","1"
"java.util.concurrent.ThreadPoolExecutor.setCorePoolSize(int)","2"
"org.openstreetmap.josm.data.cache.HostLimitQueue.offer(Runnable) HostLimitQueue.java:143","2"
"org.openstreetmap.josm.data.cache.HostLimitQueue.offer(Object) HostLimitQueue.java:30","2"
"java.util.concurrent.ThreadPoolExecutor.execute(Runnable)","2"
"org.openstreetmap.josm.data.cache.JCSCachedTileLoaderJob.submit(ICachedLoaderListener, boolean) JCSCachedTileLoaderJob.java:168","2"
"org.openstreetmap.josm.data.imagery.TMSCachedTileLoaderJob.submit(boolean) TMSCachedTileLoaderJob.java:150","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer.loadTile(Tile, boolean) AbstractTileSourceLayer.java:891","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer.access$100(AbstractTileSourceLayer, Tile, boolean) AbstractTileSourceLayer.java:141","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer$TileSet.loadAllTiles(boolean) AbstractTileSourceLayer.java:1272","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer$TileSet.access$1000(AbstractTileSourceLayer$TileSet, boolean) AbstractTileSourceLayer.java:1183","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer.drawInViewArea(Graphics2D, MapView, ProjectionBounds) AbstractTileSourceLayer.java:1514","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer.access$2700(AbstractTileSourceLayer, Graphics2D, MapView, ProjectionBounds) AbstractTileSourceLayer.java:141","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer$TileSourcePainter.doPaint(MapViewGraphics) AbstractTileSourceLayer.java:1892","2"
"org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer$TileSourcePainter.paint(MapViewGraphics) AbstractTileSourceLayer.java:1886","2"
"org.openstreetmap.josm.gui.MapView.paintLayer(Layer, Graphics2D) MapView.java:468","2"
"org.openstreetmap.josm.gui.MapView.drawMapContent(Graphics2D) MapView.java:583","2"
"org.openstreetmap.josm.gui.MapView.paint(Graphics) MapView.java:490","2"
"java.awt.EventDispatchThread.run()","2"


by removing "HostLimitQueue.offer()" and setting executor.setCorePoolSize(maximumPoolSize) in setExecutor it improved significantly, but I am quite confident this is no solution to this issue, just a hint to confirm the hypothesis that the dynamic thread handling blocks.

Attachments (4)

josm-custom-2018-09-11(7).zip (7.9 MB) - added by stephankn 2 years ago.
YourKit profiler snapshot
josm-corepoolsize.patch (6.6 KB) - added by wiktorn 2 years ago.
josm-custom-2018-09-20(1).zip (15.2 MB) - added by stephankn 2 years ago.
josm-custom-2018-09-30(6).7z (8.2 MB) - added by stephankn 2 years ago.
profiler snapshot as mentioned before

Change History (29)

Changed 2 years ago by stephankn

YourKit profiler snapshot

comment:1 Changed 2 years ago by Don-vip

Cc: wiktorn added
Component: CoreCore imagery
Keywords: cache performance added

comment:2 Changed 2 years ago by Don-vip

Can you please attach your status report too?

comment:4 Changed 2 years ago by wiktorn

Generally, raising concurrent tasks and number of them per host makes sense when you have high bandwidth but high latency link. If you have good Internet connection it should not be necessary. If the problem is that server (for example WMS) is taking long to respond, you will put this way a lot of stress on it.

Removing the hack in offer means, that we will never discard threads from the pool. But if we would call allowCoreThreadTimeOut(true) on our ThreadPoolExecutor then we could set corePoolSize = maxPoolSize and we would get desired behaviour (removing unused threads after keepAlive, spawning new threads when needed).

The problem with setCorePoolSize hack is that it acquires the lock on whole ThreadPool and then walks through all threads trying to interrupt them. This might be the bottleneck.

comment:5 Changed 2 years ago by wiktorn

Implemented such patch, but there is one drawback with such approach - until we reach max concurrent tasks in total, we will not invoke our HostLimitQueue at all, as it is only used when there is no thread available in core pool.

If we set core pool to 0, and max pool to max concurrent tasks, then we will start creating extra workers only, if queue will refuse to accept the job. And as we use infinite queues, it will never happen.

@stephankn: can you please try attached patch? It uses bounded queues with DiscardOldest policy. My quick tests show that it might be something that we want and it may also address your issue.

Changed 2 years ago by wiktorn

Attachment: josm-corepoolsize.patch added

comment:6 Changed 2 years ago by stephankn

@wiktorn: thanks for the patch. I will check tomorrow evening as today I already have some other obligations.

How many resources are actually used by threads in the pool? Is it worth the effort and complexity of dynamically adjusting the size? The threads are used for loading tiles, right? So why not increase the pool by a static number when imagery layers are added (max connections per host (plus maybe two spare ones?)) and remove them when imagery layer is removed.

comment:7 Changed 2 years ago by wiktorn

We try not to overload tile servers or internet connection. We mimic web browsers in this behaviour and configuration. As one server may serve many tiles, it's better to trace this globally.

While checking this I noticed that the only place where it's not done is WMS.

Anyway - I hope that removal of setCorePoolSize hack will solve your problem.

comment:8 Changed 2 years ago by stephankn

@wiktorn: I applied the patch and can confirm that the previously observed longer blocks are all gone.

I can still see some sporadic blocks of the painting, but these seem unrelated. Another strange 14s block happened once, but I can't yet figure out how exactly the paint loop was blocked.
I will keep an eye on it and in case I have more details I'll create a different ticket.

comment:9 Changed 2 years ago by wiktorn

After some pondering about possible solutions here, I see three ways to go:

  1. Stay with ThreadPoolExecutor and host aware queue - it means that some of jobs submitted run under the radar of the limit (it is only enforced when task enters the queue. If there is thread readily available to execute the task - nobody will check the host limit)
  2. Change implementation to ThreadPoolExecutor with beforeExecute / afterExecute. Then host limit will be checked always, though threads will be blocked waiting for free slot instead of finding another job to execute. We could avoid that by moving the task to the end of queue after short timeout waiting on lock, though this would randomize the order in which we download tiles
  3. Change the limit model to specific number of threads per layer and do no further inspections. Assume that layer is provided by one "host" and that multiple layers point to different hosts
  4. Duplicate ThreadPoolExecutor and make necessary changes so the tasks always goes first through the queue (meaning - addWorker always takes "empty task" for a start, we start a new thread only when queue is not empty and less than max workers are running, execute always puts task in the queue)

I personally lean towards 3. solution for the sake of simplicity. 4th seams attractive and correct but I fear code duplication and subtle bugs due to different behaviour, locks etc.

@Don-vip - what do you think?

comment:10 Changed 2 years ago by stoecker

Solution 3 makes the idea of switch keyword in URLs useless.

comment:11 Changed 2 years ago by stephankn

@Dirk: why? It will still switch between different hosts, which could end up on different end-points on load balancers or so. Even before: Why using the switch at all? It was used in browsers to overcome the old hardcoded limit of two concurrent connections per host. a.tile, b.tile, ... all pointed to the same host as alias. Just tricked the browser into thinking of different hosts. As JOSM is no browser it did not have such a limit at all. Also not before this patch. It did build connection up to the configured max limit. Modern browsers have much larger limits. Here they state for most browsers 6 connection, some up to 13: https://docs.pushtechnology.com/cloud/latest/manual/html/designguide/solution/support/connection_limitations.html

So using the switch trick was unneeded in JOSM already. Simply specify 24 max connections instead four hosts with six each.

comment:12 in reply to:  9 Changed 2 years ago by Don-vip

Replying to wiktorn:

@Don-vip - what do you think?

I'm not sure to fully understand neither the problem nor the solutions. So I have no idea what's the best one. My advice would be to opt for the simplest and safer as long as it does not break any existing feature.

comment:13 in reply to:  11 ; Changed 2 years ago by Don-vip

Replying to stephankn:

@Why using the switch at all?

Don't we simply use this for load balancing among different servers?

comment:14 in reply to:  13 Changed 2 years ago by wiktorn

Replying to Don-vip:

Replying to wiktorn:

@Don-vip - what do you think?

I'm not sure to fully understand neither the problem nor the solutions. So I have no idea what's the best one. My advice would be to opt for the simplest and safer as long as it does not break any existing feature.

The case is - either we remove host limit feature or we go with ThreadPoolExecutor reimplementation which may be a bit tricky.

It's worth noting, that our current implementation may generate more connections to the host than configured maximum.

Replying to Don-vip:

Replying to stephankn:

@Why using the switch at all?

Don't we simply use this for load balancing among different servers?

Actually, load balancing is done on different layer (at least - for OSM servers) - DNS entries. {a,b,c}.tile.openstreetmap.org all point to tile.geo.openstreetmap.org which then chooses the closest region, and then - provides few IP addresses for this region. What it hints to clients, is that if you have 3 hosts in switch, you are 3x more capable than others in handling connections.

If we will ditch host based limit, then we will treat all of them the same. I'm not sure how much valid it will be in HTTP/2 world.

OTOH - there are tile servers that, compared to OSM infrastructure, are far less capable and maybe clients should be more gentle with them by default.

comment:15 in reply to:  11 ; Changed 2 years ago by stoecker

Replying to stephankn:

@Dirk: why? It will still switch between different hosts, which could end up on different end-points on load balancers or so. Even before: Why using the switch at all? It was used in browsers to overcome the old hardcoded limit of two concurrent connections per host. a.tile, b.tile, ... all pointed to the same host as alias. Just tricked the browser into thinking of different hosts. As JOSM is no browser it did not have such a limit at all. Also not before this patch. It did build connection up to the configured max limit. Modern browsers have much larger limits. Here they state for most browsers 6 connection, some up to 13: https://docs.pushtechnology.com/cloud/latest/manual/html/designguide/solution/support/connection_limitations.html

So using the switch trick was unneeded in JOSM already. Simply specify 24 max connections instead four hosts with six each.

Yes. But that would affect all others as well. I always assumed that these host which specify more names also support more load.

Is there an easy solution which may exceed the Host limit, but not too often? I think we can live with exceeding the limit when it is an exception and happens not too often.

comment:16 in reply to:  15 Changed 2 years ago by wiktorn

Replying to stoecker:

Yes. But that would affect all others as well. I always assumed that these host which specify more names also support more load.

Is there an easy solution which may exceed the Host limit, but not too often? I think we can live with exceeding the limit when it is an exception and happens not too often.

I've double checked ThreadPoolExecutor implementation. If we set corePoolSize == maximumPoolSize then only during "warm-up" the limits will not be honoured. If we would share the pool also for WMS layers, then we would not spawn too many threads.

I'll test that in the evening and if it works as expected, I'll commit this approach. We can then check if it is enough.

comment:17 Changed 2 years ago by wiktorn

In 14261/josm:

Use corePoolSize = maxPoolSize and unbounded queue

Remove hack with calling setCorePoolSize to spawn new threads when already
corePoolSize is running. For bigger sizes of the ThreadPool setCorePoolSize took
a long time and blocked EDT thread. Now we will have always `maximum concurren
downloads` threads running for TMS layers, but this is not a big issue, as the
thread pool is shared between imagery layers.

As long as number of threads will not reach the maximum concurrent downloads
the maximum concurrent downloads per host limit will not be enforced (first n
downloads).

See: #16734

comment:18 Changed 2 years ago by wiktorn

I did not change behaviour for WMS layers, where we have per layer ThreadPool of the size simultaneous connections and a maximum concurrent downloads per host limit. As there is usually one host with WMS, maximum concurrent downloads per host might kick in earlier.

My questions:

  1. Should we have a global limit on download requests of WMS layers (as we have for TMS)?
  2. Should we have per-host limit?
  3. Should above settings be different than for TMS?

comment:19 Changed 2 years ago by stoecker

You decide. You know probably more about this topic than anybody else. I (and Vincent) can give an opinion, but it can be flawed due to missing knowledge (and the feel) of details.

comment:20 Changed 2 years ago by wiktorn

In 14262/josm:

Fix WMS simultaneus connections

When creating download thread pool set the maximum downloads per host to the
same value as simultaneus connections.

See: #16734

comment:21 Changed 2 years ago by wiktorn

Milestone: 18.09
Resolution: fixed
Status: newclosed

comment:22 Changed 2 years ago by stephankn

I updated to 14262. With that version I have again the long blocks I could no longer reproduce with the first patch.

I attach a profiler snapshot showing multiple delays, the one at the end being 10s with a tiny interruption.

This time it seems to block here:

org.openstreetmap.josm.gui.layer.AbstractTileSourceLayer$TileSet.visitTiles(Consumer, Consumer) AbstractTileSourceLayer.java:1297
       public void visitTiles(Consumer<Tile> visitor, Consumer<TilePosition> missed) {
            tilePositions().parallel().forEach(tp -> visitTilePosition(visitor, tp, missed));
        }

How independent are the worker threads in the pool? JCS Downloading xxx threads are largely blocked.

Changed 2 years ago by stephankn

comment:23 Changed 2 years ago by stephankn

maybe this comment regarding blocking parallel workers due to long-running tasks is related? https://dzone.com/articles/think-twice-using-java-8

edit: I created #16760 as it might be a new issue which now popped up as one bottle-neck was removed.

Last edited 2 years ago by stephankn (previous) (diff)

comment:24 in reply to:  22 Changed 2 years ago by wiktorn

Replying to stephankn:

I updated to 14262. With that version I have again the long blocks I could no longer reproduce with the first patch.

This is interesting. The only thing that could affect performance is the change in [14260]. You may take a try and revert it and check if this helps.

How independent are the worker threads in the pool? JCS Downloading xxx threads are largely blocked.

JCS threds are usually blocked, because they do the heavy lifting - downloading from network and also loading images into memory (BufferedImages). Though they are completely unrelated to EDT thread, which is responsible for GUI responsiveness.

Can you tell:

  • did your experience your problem on earlier (before [14262]) JOSM version when you had default values for concurrent connections?
  • do you experience this problem, when you lower maximum number of connections (total/per host)
  • do you experience this problem, when you set "Tile zoom offset" to -2 (this will reduce number of tiles shown on your screen)

Though I don't have 4k monitor, I tried setting multiple connections to high value and set "Tile zoom offset" to +2 so I'd get also a lot of tiles to manage, but it didn't trigger the bug for me.

It might be worthwhile to rewrite visitTiles to use specific ThreadPool instead of generic one as this is called from EDT and we can't allow this part to be blocked by other tasks (though - I didn't notice such situation in your profiler snapshots).

comment:25 Changed 2 years ago by anonymous

I had a bit time to work on reproduction again.

previously I had 14247 and experienced the delays reported here. With the patches applied on top of it it was much better.

With version 14262 I had again delays. With the changes of 14260 I still face delays. I currently can't quantify whether the revert of 14260 improved it or not. I will have to switch back and forth between the versions a bit. Maybe I can capture some snapshots in the profiler to see objective numbers.

Current settings is 64 connections parallel, max 12 per host. It felt like lower numbers for concurrent tasks make the hang less likely, but this could be coincidence as well. It happens frequently, but I have no cool "automated" way to reproduce it always the same way.

Reproduction is:
1) Opening empty data layer
2) adding DG imagery layer
3) place a node, then move the mouse to see the way follow the cursor. Zoom/Pan and move the mouse. Hangs are easy visible.

I tried version (14262 without 14260) with the offset -2 setting: I still got quite easy a 2.5s Event block.
Tried the opposite setting, +2, just to see if it changes anything: Again getting hangs. Maybe a bit more, but certainly not longer delays each. Map shows a lot of "flickering" of tiles.

Going for another test down to 4 threads per host, 8 max, offset again at 0:
I was not able to reproduce any event block long enough to be reported by the profiler.

So certainly related somehow to thread management.

Going to the other extreme: max threads 256, 64 per host: I have quite frequent and long blocks.
Individual ones are in the range of 2.5s, but they come back-to-back. If you are interested in the snapshot, let me know. Snapshot 6 has a phase of nearly 20s of blocks, split up into individual one.
I am not certain how to read the monitor usage time in yourKit correctly. My impression is that they always count from the start of the program. Clearing telemetry doesn't seem to reset the times.

As you mentioned visitTiles. The one it might conflict with could be StyledMapRenderer.paintWithLock(), which also ends up with ForkJoinPool.

edit: snapshot (6) attached.

Last edited 2 years ago by stephankn (previous) (diff)

Changed 2 years ago by stephankn

profiler snapshot as mentioned before

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain team.
as The resolution will be set.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.