niXforums Forum Index
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   PreferencesPreferences   Log in to check your private messagesLog in to check your private messages   Log inLog in 
·  nixdoc.net ·  man pages ·  Linux HOWTOs ·  FreeBSD Tips ·  Forums
navigation Forum index » Programming » python
unicode html
Post new topic   Reply to topic Page 1 of 1 [15 Posts] View previous topic :: View next topic
Author Message
lorenzo.viscanti@gmail.co
*nix forums beginner


Joined: 17 Jul 2006
Posts: 1

PostPosted: Mon Jul 17, 2006 12:36 pm    Post subject: unicode html Reply with quote

X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.

thanks,
lorenzo
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Mon Jul 17, 2006 1:18 pm    Post subject: Re: unicode html Reply with quote

lorenzo.viscanti@gmail.com enlightened us with:
Quote:
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities.

Why would you do that? You can simply encode your HTML in Unicode, and
not bother with entities at all. Check out the HTML source of my
website http://www.stuvel.eu/. You'll see that I don't use entities at
all for the ü character.

Quote:
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.

Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Gerard Flanagan
*nix forums Guru Wannabe


Joined: 19 Sep 2005
Posts: 148

PostPosted: Mon Jul 17, 2006 3:07 pm    Post subject: Re: unicode html Reply with quote

lorenzo.viscanti@gmail.com wrote:
Quote:
X-No-Archive: yes
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.

thanks,
lorenzo

no expertise with unicode issues but using 'pytextile' at the minute
which converts non-ascii to (numeric) html entities. It does something
like:

Quote:
s =unicode('\xe7', encoding='latin-1')
s
u'\xe7'
print s
ç
print s.encode('ascii','xmlcharrefreplace')
ç



http://wiki.python.org/moin/PyTextile


hth

Gerard
Back to top
Jim
*nix forums Guru


Joined: 20 Feb 2005
Posts: 609

PostPosted: Mon Jul 17, 2006 6:31 pm    Post subject: Re: unicode html Reply with quote

Sybren Stuvel wrote:
Quote:
lorenzo.viscanti@gmail.com enlightened us with:

As an example I would like to do this kind of conversion:
\uc3B4 => ô
for all available html entities.

Why would you want that? Just make sure you declare your document as
UTF-8, encode it as such, and you're done. Much easier.
For example, I am programming a script that makes html pages, but I do

not have the ability to change the "Content-Type .. charset=.." line
that is sent preceeding those pages.

Jim
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Mon Jul 17, 2006 6:38 pm    Post subject: Re: unicode html Reply with quote

Jim enlightened us with:
Quote:
For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.

"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta>
tag that describes the content.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Jim
*nix forums Guru


Joined: 20 Feb 2005
Posts: 609

PostPosted: Mon Jul 17, 2006 7:51 pm    Post subject: Re: unicode html Reply with quote

Sybren Stuvel wrote:
Quote:
Jim enlightened us with:
For example, I am programming a script that makes html pages, but I
do not have the ability to change the "Content-Type .. charset=.."
line that is sent preceeding those pages.

"line"? Are you talking about the HTTP header? If it is wrong, it
should be corrected. If you are in control of the content, you should
also be control of the Content-Type header. Otherwise, use a <meta
tag that describes the content.
Ah, but I cannot change it. It is not my machine and the folks who own

the machine perceive that the charset line that they use is the right
one for them. (Many people ship pages off this machine.)

Unfortunately, the <meta> tag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html
in section 5.2.2 where it states that in a contest the charset
parameter wins.

My only point is that things are complicated and that there are times
when HTML entities are the answer (or anyway, an answer).

Jim
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Mon Jul 17, 2006 8:50 pm    Post subject: Re: unicode html Reply with quote

Jim enlightened us with:
Quote:
Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.

Well, _you_ are the one providing the content, aren't you?

Quote:
(Many people ship pages off this machine.)

Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.

Quote:
Unfortunately, the <meta> tag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html in section 5.2.2 where it
states that in a contest the charset parameter wins.

I assume that with "the charset parameter" you mean "the HTTP header",
as the <meta> tag also has a "charset parameter".

Quote:
My only point is that things are complicated

Call me thick, but from my point of view they aren't. Everybody can
have their own default character encoding, as long as the software is
configured properly.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Jim
*nix forums Guru


Joined: 20 Feb 2005
Posts: 609

PostPosted: Mon Jul 17, 2006 10:02 pm    Post subject: Re: unicode html Reply with quote

Sybren Stuvel wrote:
Quote:
Jim enlightened us with:
Ah, but I cannot change it. It is not my machine and the folks who
own the machine perceive that the charset line that they use is the
right one for them.

Well, _you_ are the one providing the content, aren't you?
? This site has many people operating off of it (it is

sourceforge-like) and the operators (who are volunteers) are kind
enough to let us use it in the first place. I presume that they think
the charset line that they use is the one that most people want.
Probably if they changed it then someone else would complain.

Quote:
Sounds like they either don't know what they are talking about, or use
incompetent software. With Apache, it's very easy to give every
directory its own default character encoding header.
I am operating under constraints. Asking the operators of the site has

led to the understanding that I must work with the charset parameter
that I have. That is, I have an environment in which I must work, and
whether you or I think the people providing the service should do it
differently doesn't matter. I replied originally because I thought I
could give an example of HTML entities providing a way that I can solve
the problem that is entirely under my control.

Quote:
Unfortunately, the <meta> tag idea also does not fly: see
http://www.w3.org/TR/html4/charset.html in section 5.2.2 where it
states that in a contest the charset parameter wins.

I assume that with "the charset parameter" you mean "the HTTP header",
as the <meta> tag also has a "charset parameter".
AIUI "charset parameter" is the language of the HTML standard that I

referred to. For the meta tag, I at least would use "charset
attribute".

Quote:
My only point is that things are complicated

Call me thick, but from my point of view they aren't.
Wink


Jim
Back to top
Damjan
*nix forums Guru Wannabe


Joined: 24 Feb 2005
Posts: 226

PostPosted: Mon Jul 17, 2006 10:02 pm    Post subject: Re: unicode html Reply with quote

Quote:
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => &ocirc;

'&#%d;' % ord(u'\u0430')

or

'&#x%x;' % ord(u'\u0430')

Quote:
for all available html entities.


--
damjan
Back to top
Stefan Behnel
*nix forums addict


Joined: 18 Apr 2005
Posts: 81

PostPosted: Tue Jul 18, 2006 6:43 am    Post subject: Re: unicode html Reply with quote

lorenzo.viscanti@gmail.com wrote:
Quote:
Hi, I've found lots of material on the net about unicode html
conversions, but still i'm having many problems converting unicode
characters to html entities. Is there any available function to solve
this issue?
As an example I would like to do this kind of conversion:
\uc3B4 => &ocirc;
for all available html entities.

I don't know how you generate your HTML, but ElementTree and lxml both have
good HTML parsers, so that you can let them write out the result with an
"US-ASCII" encoding and they will generate numeric entities for everything
that's not ASCII.

Quote:
from lxml import etree
root = etree.HTML(my_html_data)
html_7_bit = etree.tostring(root, "us-ascii")

Stefan
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Tue Jul 18, 2006 7:37 am    Post subject: Re: unicode html Reply with quote

Jim enlightened us with:
Quote:
AIUI "charset parameter" is the language of the HTML standard that I
referred to. For the meta tag, I at least would use "charset
attribute".

Then I'm at a loss to what you actually mean. You might be confusing
HTTP with HTML.

Sorry I couldn't help you further.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Duncan Booth
*nix forums Guru


Joined: 11 Mar 2005
Posts: 422

PostPosted: Tue Jul 18, 2006 7:57 am    Post subject: Re: unicode html Reply with quote

wrote:

Quote:
As an example I would like to do this kind of conversion:
\uc3B4 => &ocirc;
for all available html entities.

u"\u3cB4".encode('ascii','xmlcharrefreplace')
'㲴'


Don't bother using named entities. If you encode your unicode as ascii
replacing all non-ascii characters with the xml entity reference then your
pages will display fine whatever encoding is specified in the HTTP headers.
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Tue Jul 18, 2006 3:04 pm    Post subject: Re: unicode html Reply with quote

Duncan Booth enlightened us with:
Quote:
Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.

Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Duncan Booth
*nix forums Guru


Joined: 11 Mar 2005
Posts: 422

PostPosted: Tue Jul 18, 2006 3:21 pm    Post subject: Re: unicode html Reply with quote

Sybren Stuvel wrote:

Quote:
Duncan Booth enlightened us with:
Don't bother using named entities. If you encode your unicode as
ascii replacing all non-ascii characters with the xml entity
reference then your pages will display fine whatever encoding is
specified in the HTTP headers.

Which means OP can't use Unicode/UTF-8 entity references, since that's
not specified in the HTTP header.

That doesn't matter, character references are not affected by the network

encoding.

From http://www.w3.org/TR/html4/charset.html#h-5.3.1

Quote:
5.3.1 Numeric character references

Numeric character references specify the code position of a character
in the document character set.

The character references use the *document character set*, which is
independant of the character encoding used for network transmission. This
is defined for HTML as ISO10646, and (section 5.1) "The character set
defined in [ISO10646] is character-by-character equivalent to Unicode
([UNICODE])".
Back to top
Sybren Stuvel
*nix forums Guru


Joined: 22 Feb 2005
Posts: 550

PostPosted: Tue Jul 18, 2006 5:21 pm    Post subject: Re: unicode html Reply with quote

Duncan Booth enlightened us with:
Quote:
The character references use the *document character set*, which is
independant of the character encoding used for network transmission.
This is defined for HTML as ISO10646, and (section 5.1) "The
character set defined in [ISO10646] is character-by-character
equivalent to Unicode ([UNICODE])".

I didn't know that. Thanks for the lecture :)

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Back to top
Google

Back to top
Display posts from previous:   
Post new topic   Reply to topic Page 1 of 1 [15 Posts] View previous topic :: View next topic
The time now is Thu Jan 08, 2009 6:37 am | All times are GMT
navigation Forum index » Programming » python
Jump to:  

Similar Topics
Topic Author Forum Replies Last Post
No new posts Timeout in HTML Sonnich PHP 5 Wed Jul 19, 2006 3:54 pm
No new posts html processing Chris ( Val ) shell 1 Wed Jul 19, 2006 2:50 pm
No new posts text representation of HTML Ksenia Marasanova python 5 Wed Jul 19, 2006 10:09 am
No new posts problem due to html files, Ashutosh Mohanty Apache 2 Wed Jul 19, 2006 6:43 am
No new posts pg_restore failes - invalid byte sequence for encoding "U... Oliver Fürst PostgreSQL 2 Mon Jul 17, 2006 1:37 pm

Secured Loans | Car Loan | Bankruptcy | MPAA | Debt Consolidation
Copyright © 2004-2005 DeniX Solutions SRL
 
Other DeniX Solutions sites: Unix/Linux blog |  electronics forum |  medicine forum |  science forum | 
Privacy Policy


Powered by phpBB © 2001, 2005 phpBB Group
[ Time: 0.2365s ][ Queries: 16 (0.1001s) ][ GZIP on - Debug on ]