Modify

Opened 11 years ago

Closed 8 years ago

#8491 closed defect (fixed)

WP-Articles containing ampersand ("&") can not be opened

Reported by: jjaf.de Owned by: simon04
Priority: critical Milestone:
Component: Plugin wikipedia Version: tested
Keywords: ampersand URL htmlspecialchars Cc:

Description

WP-articles containing ampersand ("&") like Sternheim & Emanuel are composed like Sternheim_&_Emanuel which leads to a funny wikimedia error totally misleading the user.
So encoding used is HTML and should be URL?!

Attachments (1)

8491_workaround.patch (4.4 KB ) - added by simon04 10 years ago.

Download all attachments as: .zip

Change History (11)

comment:1 by simon04, 11 years ago

Owner: changed from simon04 to jjaf.de
Status: newneedinfo

Are you using the wikipedia plugin or some external tools in combination with remote control?

Please give a sequence of steps to reproduce the problem.

comment:2 by jjaf.de, 11 years ago

Locality

The bug relates to wikipedia/29435, not the external tools and remote control.

Steps to reproduce

=> You do not get to the WP-aricle as expected, but to http://de.wikipedia.org/wiki/Sternheim_&_Emanuel
Instead of 'Artikel öffnen' you can also select the way and choose 'Merkmal hinzufügen' which makes an invalid 'wikipedia'-link.

comment:3 by Don-vip, 11 years ago

Owner: changed from jjaf.de to simon04
Status: needinfonew

comment:4 by skyper, 11 years ago

Priority: normalcritical

It does not work at all with "&". False result for existing tags, invalid tag added and wrong path to page.

Raised priority cause of invalid tag.

Last edited 11 years ago by skyper (previous) (diff)

comment:5 by mdk, 10 years ago

I'm a little bit confused, but '&' in http://de.wikipedia.org/wiki/Sternheim_&_Emanuel looks like XML encoding, but we are dealing with an URL. The URL encoding for '&' is '%26' and http://de.wikipedia.org/wiki/Sternheim_%26_Emanuel leads to the correct page.

comment:6 by simon04, 10 years ago

I contacted the author of http://toolserver.org/~kolossos/geoworld/ since I think the error is already contained in the sent file. I'll await his response; otherwise we can apply a workaround.

by simon04, 10 years ago

Attachment: 8491_workaround.patch added

in reply to:  6 comment:7 by mdk, 9 years ago

Replying to simon04:

I contacted the author of http://toolserver.org/~kolossos/geoworld/ since I think the error is already contained in the sent file. I'll await his response; otherwise we can apply a workaround.

Did you get an answer in the last 15 months? ;-)

comment:9 by verdy_p, 9 years ago

You should have a look of where MediaWiki accepts characters as valid in pagenames, even if they are HTML-encoded in standard HTML.
A doc page on Mediawiki.org, gives a lot of details about the various encodings used (including in URLs to wikis):
https://www.mediawiki.org/wiki/Manual:PAGENAMEE_encoding
Basically, a valid pagename can contain any characters except controls (including tabs and newlines) and # < > [ ] _ { | }
The underscore is not supported but is in fact used as an alias for the space (and spaces are trimmed, except the first one if it occurs
For all characters supported, they are encoded as UTF-8, with URL-encoding (not HTML-encoding).

So the same bug will also applies to artitcles containing single and double quotation marks (that must be URL-encoded as %27 and %22)

The encoding to use is shown and demonstrated by the "PAGENAMEE" row in the table.

But you can also safely use the row of that table showing the "{{urlencode:...|PATH}}" variant (for a few ASCII characters they are different:

  • The tilde is URL-encoded as %7E buit this is not required and can be left verbatim as "~"
  • The ASCII space is also URL-encoded as "%20" or "+" but also encodable as "_" which is the preferred shorter form)
  • The question mark "?" is also valid in pagenames but must be URLencoded as %3F
  • All non-ASCII characters must normally be URL-encoded according to their UTF-8 encoding (e.g. "%C2%A0" for NBSP, which is also valid in pagenames, but NOT "&nbsp;" or "&amp;nbsp;" or "%27nbsp;" or other similar inventions using sharp signs "#" which are forbidden in URL paths!) but MediaWiki servers also locates pagfes if you requests them without the URLencoding of non-ASCII characters, provided that the underlying Web server sofware (an Apache module or the CGI extension running PHP) is 8-bit clean (this is the case for all Wikipedia servers on their official domains, except a few unofficial mirrors on other domains).

Note also that this encoding is DIFFERENT for anchors (the part after a shart sign "#" in wikilinks): it uses the encoding exposed by its {{anchorencode:...}} parserfunction instead.

The anchor is not part the HTTP "GET" or "STAT" request but handled on the client side, or transmitted as metadata: sometimes a link to Wikipedia wsill bring you to a redirected page using a specific anchor and you have to get this metadata to interpret the link correctly, once the server is giving you back the content of the redirected page and not the requested page.

This cases happens in Wikipedia when some old municipalities (or smaller entities) have merged and their articles in Wikipedia have been merged into the same article (about a larger entity), via a redirect using anchors to a section of the merged article.

The anchor-encoding in MediaWiki is specific to MediaWiki (there's no standard in HTML or HTTP(S) about this, the HTML gives some hints about the recommanded character sets supported and only standardizes the "id" attributes but not other anchors such as those using "name" attributes; it does not indicate how to map random valid strings to valid anchors)

MediaWiki defines its constraints based on the way it can generate anchros for valid section names (valid section names can be any string which are only trimmed, but their lettercase is significant; the encoding is also derived from URL-encoding, but uses "." instead of "%"; if there are distinct sections having the same heading, the anchorecoding generates the same default anchor for them, so these anchors are not unique, and articles should then add specific anchors using id="" attributes to some HTML tag, or create enhanced headings so that all section headings in the page will be distinct; but MediaWiki may also generate unique id's by appending some unpredictable numeric suffix to the generated ids, or no id at all for these sections for strict XML conformance but then it will not be possible to link to a valid anchror to the appropraite section if this is not the first one in the page).

comment:10 by simon04, 8 years ago

Resolution: fixed
Status: newclosed

https://github.com/aiomaster/WIWOSM/issues/10 still hasn't been resolved.

Switched to MediaWiki API to query by coordinates in [o32606].

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain simon04.
as The resolution will be set.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences .
 
Note: See TracTickets for help on using tickets.