Modify

Opened 8 days ago

Closed 8 days ago

Last modified 5 hours ago

#24266 closed defect (othersoftware)

josm.openstreetmap.de server slow

Reported by: Stereo Owned by: stoecker
Priority: major Milestone:
Component: Trac Version: latest
Keywords: Cc:

Description

Since yesterday, the JOSM server has been slow to respond. I'm getting timeouts in both my web browser and JOSM itself (MOTD, imagery...). https://downforeveryoneorjustme.com/josm.openstreetmap.de?proto=https shows intermittent outages.

Attachments (0)

Change History (13)

comment:1 by stoecker, 8 days ago

Resolution: othersoftware
Status: newclosed

See top of wiki:StartupPage.

comment:2 by Stereo, 8 days ago

The OSMF might be able to help with proxies and CDNs, if you’re interested? You still have my email?

comment:3 by stoecker, 8 days ago

Until there is a real need I don't want a much more complicated infrastructure. Some of the effects are also caused by updating the server, migrating the database, improving the setup, increasing performance and so on. When the service is anyway unstable I took the liberty to do all the pending maintenance ;-)

Some of the troubles of today also were results of relaxing the IP blocks for bad actors - seems some of them didn't learn anything.

Currently it's looking manageable with a rate of approx. 1 million requests a day which is much better than 4 days ago with twice as much. Also I block some access patterns, which wasted a LOT of resources for nothing.

The server should now be able to deliver at least twice as much requests as before. And a lot of counter-measures are automated now.

Such cases with the JOSM server are also a bit of a testing system for me to learn how to better cope with such situations which is extremely helpful in my work. I hope I'll never have such issues with our customer systems, but JOSM server helps me to be prepared.

When I see what's possible with a 60€ server and how a lot of people and companies pay for much less final performance...

P.S. If you know a postgres expert who could help me with all the details of connection numbers, properly setting up a pooling daemon like pgbouncer and knowing the overall performance benefits and disadvantages of such a system I'd like to know. Or ways to convince apache to accept many more connections without overloading the database (i.e. pooling again). But I know already a lot about that, so I'd need a REAL expert :-)

comment:4 by stoecker, 8 days ago

P.S. For all these now bursting with curiosity: JOSM-Server

  • normally transmits more than 4TB data each month
  • until last year had about 90k visitors a month which accessed 300k pages and caused about 9 million requests
  • this year numbers increased to 350k visitors, 700k pages and 15 million requests
  • April 2025 (i.e. half a month only) has 2 million visitors, 8 million pages and 12 million requests

Until recently I reacted to slowness of the server with blocking bad actors (usually whole network segments which I don't like much because of collateral damage). Typically these could be tracked to AI companies which forget all rules how to behave to get data. I'm not opposed to the data scraping they do, only to the method they are doing it. Hey guys: Rather than hammering one server with dozens requests at the same time how about changing your bots to access dozens of servers with one request? Google and the others also learned that, you also can learn to be nice and get your data.

The increased number of visitors in April shows that either a LOT of new actors came up or site scraping now is using a botnet. Access comes from all-over the world. There is no more chance to filter that by IP. I had to improve the whole setup.

Last edited 8 days ago by stoecker (previous) (diff)

comment:5 by stoecker, 6 days ago

Hmm, is it only me or does the server react faster then before the trouble started (even thought the load is still higher than it was)?

comment:6 by GerdP, 6 days ago

It was fast maybe an hour ago, now it is very slow again.

in reply to:  6 comment:7 by stoecker, 6 days ago

Replying to GerdP:

It was fast maybe an hour ago, now it is very slow again.

Whoa. ATM it's a wonder you get an answer at all ;-)

Don't know whether it's broken scanning or a mini DDOS, but currently all Apache connection slots are full. And that are 10 times more than 3 days ago.

Last edited 6 days ago by stoecker (previous) (diff)

comment:8 by Firefishy, 2 days ago

Here are a few tips:

Ensure you have apache logging per request timing (%Dus) eg:

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %Dus %{SSL_PROTOCOL}x %{SSL_CIPHER}x" combined_extended
CustomLog /var/log/apache2/trac.openstreetmap.de-access.log combined_extended

Then use gawk to sort IPs by the client using causing the slowest requests eg:

tail -n 1000000 /var/log/apache2/trac.openstreetmap.de-access.log | gawk '{ match($0,/ ([0-9]+)us /,m) } { sum[$1] += m[1] } END { for(ip in sum) print sum[ip],ip }'|sort -nr | head -n 50

What is the system's load average? High memory usage? What apache MPM is being used?

What are the values of:

MinSpareThreads
MaxSpareThreads
ThreadLimit
ThreadsPerChild
MaxRequestWorkers
MaxConnectionsPerChild

Any chance I could see the server info stats? a2enmod info

<IfModule mod_info.c>
<Location /server-info>
    SetHandler server-info
    Require ip 2001:8b0:deeb:b120::/64
    Require ip 127.0.1.1
    Require ip 127.0.0.1
    Require ip ::1
</Location>
</IfModule>

I am on #osm-dev on https://irc.openstreetmap.org/ happy to help if desired.

comment:9 by stoecker, 2 days ago

In cases when the server is slow the server currently has all 1024 parallel requests in use with about 30 requests/s going up to 1.5-2 million requests a day. As the target systems are really slow the slots are blocked a few seconds for data sending.

When I increase the parallel slots (which is still possible load-wise), then I simply will have more requests. Everything I provide is used up...

I can block the target URLs to get the server faster, but that will affect normal users as well, as I can't define

  • bad URLs
  • bad access patterns
  • bad IPs
  • bad user agents

It's simply normal traffic, only way too much of it.

E.G. a few minutes ago I simply returned 503 for two URLs patterns and I essentially got rid of the load, but then these two URLs wont work anymore.

A small detail giving you an indication why most methods to cope with high load fail: When I check the last 10000 connections I have 8000 different IPs.

Any chance I could see the server info stats

No. There are simply to much sensible details in there. Server is running Apache worker on a 64GB memory system with 8 cores.

What I'm currently searching would be a URL-pattern limiter: Do not accept more than XXX requests to url-pattern YYY, return 503 for any more requests. That would keep the server fast for all other requests without blocking the specific URLs completely. Didn't find anything like that. If this does not go down again or I find a solution for temporary rejects I'll have to write my own Apache module for that.

in reply to:  9 ; comment:10 by Firefishy, 2 days ago

Replying to stoecker:

In cases when the server is slow the server currently has all 1024 parallel requests

1024 parallel requests seems high for a system with 8 cores. Apache isn't very efficient. haproxy or nginx could likely handle that parallel level and might be a good option for offloading apache.

What I'm currently searching would be a URL-pattern limiter: Do not accept more than XXX requests to url-pattern YYY, return 503 for any more requests. That would keep the server fast for all other requests without blocking the specific URLs completely. Didn't find anything like that. If this does not go down again or I find a solution for temporary rejects I'll have to write my own Apache module for that.

You can do this with mod_security.

# Init collections
SecAction "id:900000,phase:1,nolog,pass,initcol:ip=%{REMOTE_ADDR}"

# Only trigger for /api/expensive
SecRule REQUEST_URI "@beginsWith /api/expensive" "phase:1,nolog,pass,chain"
    SecRule IP:EXPENSIVE_COUNTER "@gt 5" \
        "id:900001,deny,status:429,msg:'Rate limit exceeded for /api/expensive'"

# Increment counter and expire in 10 seconds
SecRule REQUEST_URI "@beginsWith /api/expensive" "phase:1,pass,nolog,id:900002,\
    setvar:ip.EXPENSIVE_COUNTER=+1,expirevar:ip.EXPENSIVE_COUNTER=10"

Or by IPv4 /24 subnet:

SecAction "id:900010,phase:1,nolog,pass,setvar:tx.SUBNET=%{REMOTE_ADDR}"
SecRule TX:SUBNET "@rx ^(\d+\.\d+\.\d+)\." "phase:1,nolog,pass,setvar:tx.SUBNET_PREFIX=%{tx.1}"
SecAction "phase:1,nolog,pass,initcol:ip=%{tx.SUBNET_PREFIX}"

in reply to:  10 comment:11 by stoecker, 30 hours ago

Replying to Firefishy:

Replying to stoecker:

In cases when the server is slow the server currently has all 1024 parallel requests

1024 parallel requests seems high for a system with 8 cores. Apache isn't very efficient. haproxy or nginx could likely handle that parallel level and might be a good option for offloading apache.

Nah, that's fine. Apache improved a lot in the last years...

You can do this with mod_security.

Couldn't get this to work with mod_security (see mails).

I found a solution extending my apache watchdog script. I'd still prefer a fine-grained solution inside apache, but ATM the current solution seems to work.

If the site scrapers overdo it, they often get 503 errors, which slows them down (Apache can deliver really a lot of 503 errors in short time :-) This keeps reaction times for all the rest in a much more acceptable range (for now).

comment:12 by stoecker, 24 hours ago

Seems the new automatic method works. ;-)

The rejection rate went up to 300request/s then there came a phase where it seems it was tested what the server will permit and now the requests do not exceed a reasonable count.

Hope it stays like this.

comment:13 by stoecker, 5 hours ago

Now it's one day since the new setup and it seems it protects the server reasonable enough to prevent larger waiting times for legitimate traffic. You'll still notice the high load phases, but it seems it does not cause such a large amount of timeouts as before.

I know for sure now, that the site scrapers have an automatic feedback system, as they react to what the server does.

But it seems they don't have a handling for my hysteresis approach: Blocking needs a much higher load level than unblocking. Optimium from the scrapers side would be to stay little below my block level. But instead they always try to increase access and then need to fall to the lower level to get unblocked. As they can't cope with that it seems a hysteresis is uncommon for such systems, which I don't understand - it's an obvious approach to me. Or maybe the scraper script developers are a bit dumb, that could also explain the whole problem ;-)

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain stoecker.
as The resolution will be set.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.