#24266 closed defect (othersoftware)
josm.openstreetmap.de server slow
Reported by: | Stereo | Owned by: | stoecker |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Trac | Version: | latest |
Keywords: | Cc: |
Description
Since yesterday, the JOSM server has been slow to respond. I'm getting timeouts in both my web browser and JOSM itself (MOTD, imagery...). https://downforeveryoneorjustme.com/josm.openstreetmap.de?proto=https shows intermittent outages.
Attachments (0)
Change History (13)
comment:1 by , 8 days ago
Resolution: | → othersoftware |
---|---|
Status: | new → closed |
comment:2 by , 8 days ago
The OSMF might be able to help with proxies and CDNs, if you’re interested? You still have my email?
comment:3 by , 8 days ago
Until there is a real need I don't want a much more complicated infrastructure. Some of the effects are also caused by updating the server, migrating the database, improving the setup, increasing performance and so on. When the service is anyway unstable I took the liberty to do all the pending maintenance ;-)
Some of the troubles of today also were results of relaxing the IP blocks for bad actors - seems some of them didn't learn anything.
Currently it's looking manageable with a rate of approx. 1 million requests a day which is much better than 4 days ago with twice as much. Also I block some access patterns, which wasted a LOT of resources for nothing.
The server should now be able to deliver at least twice as much requests as before. And a lot of counter-measures are automated now.
Such cases with the JOSM server are also a bit of a testing system for me to learn how to better cope with such situations which is extremely helpful in my work. I hope I'll never have such issues with our customer systems, but JOSM server helps me to be prepared.
When I see what's possible with a 60€ server and how a lot of people and companies pay for much less final performance...
P.S. If you know a postgres expert who could help me with all the details of connection numbers, properly setting up a pooling daemon like pgbouncer and knowing the overall performance benefits and disadvantages of such a system I'd like to know. Or ways to convince apache to accept many more connections without overloading the database (i.e. pooling again). But I know already a lot about that, so I'd need a REAL expert :-)
comment:4 by , 8 days ago
P.S. For all these now bursting with curiosity: JOSM-Server
- normally transmits more than 4TB data each month
- until last year had about 90k visitors a month which accessed 300k pages and caused about 9 million requests
- this year numbers increased to 350k visitors, 700k pages and 15 million requests
- April 2025 (i.e. half a month only) has 2 million visitors, 8 million pages and 12 million requests
Until recently I reacted to slowness of the server with blocking bad actors (usually whole network segments which I don't like much because of collateral damage). Typically these could be tracked to AI companies which forget all rules how to behave to get data. I'm not opposed to the data scraping they do, only to the method they are doing it. Hey guys: Rather than hammering one server with dozens requests at the same time how about changing your bots to access dozens of servers with one request? Google and the others also learned that, you also can learn to be nice and get your data.
The increased number of visitors in April shows that either a LOT of new actors came up or site scraping now is using a botnet. Access comes from all-over the world. There is no more chance to filter that by IP. I had to improve the whole setup.
comment:5 by , 6 days ago
Hmm, is it only me or does the server react faster then before the trouble started (even thought the load is still higher than it was)?
comment:7 by , 6 days ago
Replying to GerdP:
It was fast maybe an hour ago, now it is very slow again.
Whoa. ATM it's a wonder you get an answer at all ;-)
Don't know whether it's broken scanning or a mini DDOS, but currently all Apache connection slots are full. At that are 10 times more than 3 days ago.
comment:8 by , 2 days ago
Here are a few tips:
Ensure you have apache logging per request timing (%Dus) eg:
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %Dus %{SSL_PROTOCOL}x %{SSL_CIPHER}x" combined_extended CustomLog /var/log/apache2/trac.openstreetmap.de-access.log combined_extended
Then use gawk to sort IPs by the client using causing the slowest requests eg:
tail -n 1000000 /var/log/apache2/trac.openstreetmap.de-access.log | gawk '{ match($0,/ ([0-9]+)us /,m) } { sum[$1] += m[1] } END { for(ip in sum) print sum[ip],ip }'|sort -nr | head -n 50
What is the system's load average? High memory usage? What apache MPM is being used?
What are the values of:
MinSpareThreads MaxSpareThreads ThreadLimit ThreadsPerChild MaxRequestWorkers MaxConnectionsPerChild
Any chance I could see the server info stats? a2enmod info
<IfModule mod_info.c> <Location /server-info> SetHandler server-info Require ip 2001:8b0:deeb:b120::/64 Require ip 127.0.1.1 Require ip 127.0.0.1 Require ip ::1 </Location> </IfModule>
I am on #osm-dev on https://irc.openstreetmap.org/ happy to help if desired.
follow-up: 10 comment:9 by , 2 days ago
In cases when the server is slow the server currently has all 1024 parallel requests in use with about 30 requests/s going up to 1.5-2 million requests a day. As the target systems are really slow the slots are blocked a few seconds for data sending.
When I increase the parallel slots (which is still possible load-wise), then I simply will have more requests. Everything I provide is used up...
I can block the target URLs to get the server faster, but that will affect normal users as well, as I can't define
- bad URLs
- bad access patterns
- bad IPs
- bad user agents
It's simply normal traffic, only way too much of it.
E.G. a few minutes ago I simply returned 503 for two URLs patterns and I essentially got rid of the load, but then these two URLs wont work anymore.
A small detail giving you an indication why most methods to cope with high load fail: When I check the last 10000 connections I have 8000 different IPs.
Any chance I could see the server info stats
No. There are simply to much sensible details in there. Server is running Apache worker on a 64GB memory system with 8 cores.
What I'm currently searching would be a URL-pattern limiter: Do not accept more than XXX requests to url-pattern YYY, return 503 for any more requests. That would keep the server fast for all other requests without blocking the specific URLs completely. Didn't find anything like that. If this does not go down again or I find a solution for temporary rejects I'll have to write my own Apache module for that.
follow-up: 11 comment:10 by , 2 days ago
Replying to stoecker:
In cases when the server is slow the server currently has all 1024 parallel requests
1024 parallel requests seems high for a system with 8 cores. Apache isn't very efficient. haproxy or nginx could likely handle that parallel level and might be a good option for offloading apache.
What I'm currently searching would be a URL-pattern limiter: Do not accept more than XXX requests to url-pattern YYY, return 503 for any more requests. That would keep the server fast for all other requests without blocking the specific URLs completely. Didn't find anything like that. If this does not go down again or I find a solution for temporary rejects I'll have to write my own Apache module for that.
You can do this with mod_security.
# Init collections SecAction "id:900000,phase:1,nolog,pass,initcol:ip=%{REMOTE_ADDR}" # Only trigger for /api/expensive SecRule REQUEST_URI "@beginsWith /api/expensive" "phase:1,nolog,pass,chain" SecRule IP:EXPENSIVE_COUNTER "@gt 5" \ "id:900001,deny,status:429,msg:'Rate limit exceeded for /api/expensive'" # Increment counter and expire in 10 seconds SecRule REQUEST_URI "@beginsWith /api/expensive" "phase:1,pass,nolog,id:900002,\ setvar:ip.EXPENSIVE_COUNTER=+1,expirevar:ip.EXPENSIVE_COUNTER=10"
Or by IPv4 /24 subnet:
SecAction "id:900010,phase:1,nolog,pass,setvar:tx.SUBNET=%{REMOTE_ADDR}" SecRule TX:SUBNET "@rx ^(\d+\.\d+\.\d+)\." "phase:1,nolog,pass,setvar:tx.SUBNET_PREFIX=%{tx.1}" SecAction "phase:1,nolog,pass,initcol:ip=%{tx.SUBNET_PREFIX}"
comment:11 by , 28 hours ago
Replying to Firefishy:
Replying to stoecker:
In cases when the server is slow the server currently has all 1024 parallel requests
1024 parallel requests seems high for a system with 8 cores. Apache isn't very efficient. haproxy or nginx could likely handle that parallel level and might be a good option for offloading apache.
Nah, that's fine. Apache improved a lot in the last years...
You can do this with mod_security.
Couldn't get this to work with mod_security (see mails).
I found a solution extending my apache watchdog script. I'd still prefer a fine-grained solution inside apache, but ATM the current solution seems to work.
If the site scrapers overdo it, they often get 503 errors, which slows them down (Apache can deliver really a lot of 503 errors in short time :-) This keeps reaction times for all the rest in a much more acceptable range (for now).
comment:12 by , 22 hours ago
Seems the new automatic method works. ;-)
The rejection rate went up to 300request/s then there came a phase where it seems it was tested what the server will permit and now the requests do not exceed a reasonable count.
Hope it stays like this.
comment:13 by , 4 hours ago
Now it's one day since the new setup and it seems it protects the server reasonable enough to prevent larger waiting times for legitimate traffic. You'll still notice the high load phases, but it seems it does not cause such a large amount of timeouts as before.
I know for sure now, that the site scrapers have an automatic feedback system, as they react to what the server does.
But it seems they don't have a handling for my hysteresis approach: Blocking needs a much higher load level than unblocking. Optimium from the scrapers side would be to stay little below my block level. But instead they always try to increase access and then need to fall to the lower level to get unblocked. As they can't cope with that it seems a hysteresis is uncommon for such systems, which I don't understand - it's an obvious approach to me. Or maybe the scraper script developers are a bit dumb, that could also explain the whole problem ;-)
See top of wiki:StartupPage.