| Author |
Message |
lorenzo.viscanti@gmail.co *nix forums beginner
Joined: 17 Jul 2006
Posts: 1
|
Posted: Mon Jul 17, 2006 12:36 pm Post subject:
unicode html
|
|
|
X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
thanks,
lorenzo |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Mon Jul 17, 2006 1:18 pm Post subject:
Re: unicode html
|
|
|
lorenzo.viscanti@gmail.com enlightened us with:
| Quote: | Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities.
|
Why would you do that? You can simply encode your HTML in Unicode, and
not bother with entities at all. Check out the HTML source of my
website http://www.stuvel.eu/. You'll see that I don't use entities at
all for the ü character.
| Quote: | As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
|
Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Gerard Flanagan *nix forums Guru Wannabe
Joined: 19 Sep 2005
Posts: 148
|
Posted: Mon Jul 17, 2006 3:07 pm Post subject:
Re: unicode html
|
|
|
lorenzo.viscanti@gmail.com wrote:
| Quote: | X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
thanks,
lorenzo
|
no expertise with unicode issues but using 'pytextile' at the minute
which converts non-ascii to (numeric) html entities. It does something
like:
| Quote: | s =unicode('\xe7', encoding='latin-1')
s
u'\xe7'
print s
ç
print s.encode('ascii','xmlcharrefreplace')
ç |
http://wiki.python.org/moin/PyTextile
hth
Gerard |
|
| Back to top |
|
 |
Jim *nix forums Guru
Joined: 20 Feb 2005
Posts: 609
|
Posted: Mon Jul 17, 2006 6:31 pm Post subject:
Re: unicode html
|
|
|
Sybren Stuvel wrote:
| Quote: | lorenzo.viscanti@gmail.com enlightened us with:
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.
For example, I am programming a script that makes html pages, but I do |
not have the ability to change the "Content-Type .. charset=.." line
that is sent preceeding those pages.
Jim |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Mon Jul 17, 2006 6:38 pm Post subject:
Re: unicode html
|
|
|
Jim enlightened us with:
| Quote: | For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.
|
"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta>
tag that describes the content.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Jim *nix forums Guru
Joined: 20 Feb 2005
Posts: 609
|
Posted: Mon Jul 17, 2006 7:51 pm Post subject:
Re: unicode html
|
|
|
Sybren Stuvel wrote:
| Quote: | Jim enlightened us with:
For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.
"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta
tag that describes the content.
Ah, but I cannot change it. It is not my machine and the folks who own |
the machine perceive that the charset line that they use is the right
one for them. (Many people ship pages off this machine.)
Unfortunately, the <meta> tag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html
in section 5.2.2 where it states that in a contest the charset
parameter wins.
My only point is that things are complicated and that there are times
when HTML entities are the answer (or anyway, an answer).
Jim |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Mon Jul 17, 2006 8:50 pm Post subject:
Re: unicode html
|
|
|
Jim enlightened us with:
| Quote: | Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.
|
Well, _you_ are the one providing the content, aren't you?
| Quote: | (Many people ship pages off this machine.)
|
Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.
I assume that with "the charset parameter" you mean "the HTTP header",
as the <meta> tag also has a "charset parameter".
| Quote: | My only point is that things are complicated
|
Call me thick, but from my point of view they aren't. Everybody can
have their own default character encoding, as long as the software is
configured properly.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Jim *nix forums Guru
Joined: 20 Feb 2005
Posts: 609
|
Posted: Mon Jul 17, 2006 10:02 pm Post subject:
Re: unicode html
|
|
|
Sybren Stuvel wrote:
| Quote: | Jim enlightened us with:
Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.
Well, _you_ are the one providing the content, aren't you?
? This site has many people operating off of it (it is |
sourceforge-like) and the operators (who are volunteers) are kind
enough to let us use it in the first place. I presume that they think
the charset line that they use is the one that most people want.
Probably if they changed it then someone else would complain.
| Quote: | Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.
I am operating under constraints. Asking the operators of the site has |
led to the understanding that I must work with the charset parameter
that I have. That is, I have an environment in which I must work, and
whether you or I think the people providing the service should do it
differently doesn't matter. I replied originally because I thought I
could give an example of HTML entities providing a way that I can solve
the problem that is entirely under my control.
| Quote: | Unfortunately, the <meta> tag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html in section 5.2.2 where it
states that in a contest the charset parameter wins.
I assume that with "the charset parameter" you mean "the HTTP header",
as the <meta> tag also has a "charset parameter".
AIUI "charset parameter" is the language of the HTML standard that I |
referred to. For the meta tag, I at least would use "charset
attribute".
| Quote: | My only point is that things are complicated
Call me thick, but from my point of view they aren't.
 |
Jim |
|
| Back to top |
|
 |
Damjan *nix forums Guru Wannabe
Joined: 24 Feb 2005
Posts: 226
|
Posted: Mon Jul 17, 2006 10:02 pm Post subject:
Re: unicode html
|
|
|
| Quote: | Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
|
'&#%d;' % ord(u'\u0430')
or
'&#x%x;' % ord(u'\u0430')
| Quote: | for all available html entities.
|
--
damjan |
|
| Back to top |
|
 |
Stefan Behnel *nix forums addict
Joined: 18 Apr 2005
Posts: 81
|
Posted: Tue Jul 18, 2006 6:43 am Post subject:
Re: unicode html
|
|
|
lorenzo.viscanti@gmail.com wrote:
| Quote: | Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
|
I don't know how you generate your HTML, but ElementTree and lxml both have
good HTML parsers, so that you can let them write out the result with an
"US-ASCII" encoding and they will generate numeric entities for everything
that's not ASCII.
| Quote: | from lxml import etree
root = etree.HTML(my_html_data)
html_7_bit = etree.tostring(root, "us-ascii")
|
Stefan |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Tue Jul 18, 2006 7:37 am Post subject:
Re: unicode html
|
|
|
Jim enlightened us with:
| Quote: | AIUI "charset parameter" is the language of the HTML standard that I
referred to. For the meta tag, I at least would use "charset
attribute".
|
Then I'm at a loss to what you actually mean. You might be confusing
HTTP with HTML.
Sorry I couldn't help you further.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Duncan Booth *nix forums Guru
Joined: 11 Mar 2005
Posts: 422
|
Posted: Tue Jul 18, 2006 7:57 am Post subject:
Re: unicode html
|
|
|
wrote:
| Quote: | As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.
u"\u3cB4".encode('ascii','xmlcharrefreplace')
'㲴' |
Don't bother using named entities. If you encode your unicode as ascii
replacing all non-ascii characters with the xml entity reference then your
pages will display fine whatever encoding is specified in the HTTP headers. |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Tue Jul 18, 2006 3:04 pm Post subject:
Re: unicode html
|
|
|
Duncan Booth enlightened us with:
| Quote: | Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.
|
Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Duncan Booth *nix forums Guru
Joined: 11 Mar 2005
Posts: 422
|
Posted: Tue Jul 18, 2006 3:21 pm Post subject:
Re: unicode html
|
|
|
Sybren Stuvel wrote:
| Quote: | Duncan Booth enlightened us with:
Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.
Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.
That doesn't matter, character references are not affected by the network |
encoding.
From http://www.w3.org/TR/html4/charset.html#h-5.3.1
| Quote: | 5.3.1 Numeric character references
Numeric character references specify the code position of a character
in the document character set.
|
The character references use the *document character set*, which is
independant of the character encoding used for network transmission. This
is defined for HTML as ISO10646, and (section 5.1) "The character set
defined in [ISO10646] is character-by-character equivalent to Unicode
([UNICODE])". |
|
| Back to top |
|
 |
Sybren Stuvel *nix forums Guru
Joined: 22 Feb 2005
Posts: 550
|
Posted: Tue Jul 18, 2006 5:21 pm Post subject:
Re: unicode html
|
|
|
Duncan Booth enlightened us with:
| Quote: | The character references use the *document character set*, which is
independant of the character encoding used for network transmission.
This is defined for HTML as ISO10646, and (section 5.1) "The
character set defined in [ISO10646] is character-by-character
equivalent to Unicode ([UNICODE])".
|
I didn't know that. Thanks for the lecture :)
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa |
|
| Back to top |
|
 |
Google
|
|
| Back to top |
|
 |
|