Talk:UTF-7

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Technical question[edit]

What exactly is encoded as Base64 between the + and the -? Is it UTF-16? -- Timwi 02:37, 21 Dec 2003 (UTC)

Yeah: "Unicode is encoded using Modified Base64 by first converting Unicode 16-bit quantities to an octet stream (with the most significant octet first). Text with an odd number of octets is ill-formed." (from the RFC) CGS 01:04, 23 Dec 2003 (UTC).
Thanks. Fixed the article. -- Timwi 04:16, 23 Dec 2003 (UTC)
Unfortunately it seems the UTF-7 standard was not updated to clarify the situation with regard to unicode 2.0 (even though it cites version 2.0 of the unicode standard). It treats UCS-2 as synonymous with "16-bit unicode" and neither explicitly allows nor explicitly forbids use of UTF-16 surrogates. It would be interesting to know how practical implementations handle this. Plugwash (talk) 02:41, 19 August 2011 (UTC)[reply]

Thanks[edit]

Thanks for spotting that "i" :-) -- Timwi 02:00, 15 Feb 2004 (UTC)

Deprecation[edit]

does anyone have any references for when/why this was deprecated? from my understanding it will generally match or beat all 3 of the other formats practical for unicode e-mail (UTF-8 with quoted printable UTF-8 with base64 and UTF-16 with base64). Plugwash 23:55, 17 July 2005 (UTC)[reply]

The IMC's guidelines for i18n of internet e-mail (here), published Aug 1998, say use of utf-7 in internet e-mail is strongly discouraged. The Unicode 4.0 spec mentions only utf-8, utf-16 and utf-32 (conspicuously omitting utf-7). This page mentions some drawbacks, but I don't know which (if any) of these were behind abandoning utf-7. -- Rick Block (talk) 02:36, July 18, 2005 (UTC)

neither of those sites actually use the word deprecated and the internet mail consortiums site really seems to miss the point. Sure utf-8 CAN be handled by mime its just that utf-8+quoted printable is terrible (6 bytes minimum for anything non-ascii!) and utf-8+base64 isn't exactly hugely efficiant either. yw

Transfer encoding syntax[edit]

A couple months ago, an anonymous contributor (83.248.26.202) added this to the intro:

Despite the name, UTF-7 is not a UTF. It is rather a transfer encoding syntax (TES), as is Punycode for internationalized domain names.

Plugwash recently removed this and asked, in an HTML comment, what the difference is between a TES and a UTF.

I don't know exactly what the difference is, but I can say that when I was cleaning up the encoding related categories, I ran across some examples of what I was tempted to call character meta-encodings that had been misfiled as character sets:

  • Encodings for 7-bit transport of 8-bit data; these were originally intended to transport encoded text, but they're actually for any binary data. Examples include Quoted-printable, Base64, Radix-64, ASCII armor, Ascii85, Uuencode, and YEnc. If you use one of these encodings in a MIME message body, you use MIME's Content Transfer Encoding mechanism to signal that you used it.
  • Encoded-word, which is an encoding scheme for representing non-ASCII text in a MIME message header value. This is basically just mapping arbitrary UCS characters to sequences of ASCII-range Unicode characters.

I haven't studied UTF-7 at all, really (yet), but if it doesn't map arbitrary UCS characters to code values or byte sequences, then it's more like the examples above and less like the other UTFs. — mjb 00:56, 13 August 2005 (UTC)[reply]

UTF-7 does map a sequence of code points to a sequence of bytes like the other UTFs but unlike them there is more than one valid way to represent a peice of text in UTF-7 and its output is designed to be used directly in internet mail. Essentially it was a case of recognising that UTF-8+quoted printable=insanely inefficiant encoding and doing it better by designing a single process for the entire task.
From a registration and mail header point of view UTF-7 is considered to be a character set (e.g. its listed at http://www.iana.org/assignments/character-sets).Plugwash 01:19, 13 August 2005 (UTC)[reply]

Security[edit]

"UTF-7 allows multiple representations of the same source string ..." So what? The security section says nothing about security. IMO it should be completely deleted. bungalo (talk) 09:10, 12 October 2009 (UTC)[reply]

The problem with character encodings that allow multiple representations of the same string is that if and when someone attempts to do validation some representations are likely to get missed. This can allow dangerous strings to be slipped past the validation. UTF-8 "quoted printable" has the same issue though so the second part of that section was bullshit (i've fixed it up now). Ultimately it's not a problem as long as the developers of apps are aware they must decode before validating. Plugwash (talk) 22:31, 14 October 2009 (UTC)[reply]

limitation to ascii[edit]

The best I could find in RFC 821

        The mail data may contain any of the 128 ASCII characters.  All
        characters are to be delivered to the recipient's mailbox
        including format effectors and other control characters.  If
        the transmission channel provides an 8-bit byte (octets) data
        stream, the 7-bit ASCII codes are transmitted right justified
        in the octets with the high order bits cleared to zero

Furthermore in a couple of the apendicies on transmission through various mediums it says that "The SMTP data is 7-bit ASCII characters."

So while I can't find anything explicitly forbidding higher valued characters it certainly implies that support for them cannot be relied on. Therefore any sender who wants to make sure thier mail gets through unmolested needs to avoid going beyond 7-bit ascii.

If we look at RFC 2821 (the replacement for RFC 821) we see the following statement

  The content
  is textual in nature, expressed using the US-ASCII repertoire [1].
  Although SMTP extensions (such as "8BITMIME" [20]) may relax this
  restriction for the content body, the content headers are always
  encoded using the US-ASCII repertoire.

-- Plugwash (talk) 01:10, 15 December 2009 (UTC)[reply]

So, you state that “the transmission format is US-ASCII”, although it “may [be] relax[ed]”. IMHO this is nothing but an oxymoron and gibberish — if format was defined as US-ASCII, then high-bit set would not be allowed by definition. If high-bit set are possible (even though not guaranteed to treated in an expected way), then “the transmission format” is something broader than US-ASCII, by definition of the latter. Please, add citations as necessary or remove mentioning of “the transmission format” at all; just reverting my citation request for a controversial unreferenced statement is an edit at the edge of vandalism. Incnis Mrsi (talk) 04:53, 15 December 2009 (UTC)[reply]

Not yet developed: UTF-6 and UTF-5[edit]

I propose to delete this section. Not only are "UTF-5" and "UTF-6" unimplemented vaporware (they were early entries in an IDNA competition which Punycode ultimately won), but they have nothing whatsoever to do with UTF-7. Doug Ewell 19:49, 22 May 2010 (UTC) —Preceding unsigned comment added by DougEwell (talkcontribs)

Done, it was anyway completely bogus, these encodings are clearly related to "Punycode", and not intended for any other purposes. Besides it's enough to have this cruft in one article, not here. –89.204.137.230 (talk) 21:59, 10 June 2011 (UTC)[reply]

Microsoft doesn't say that any more[edit]

The paragraph beginning "Confusingly, Microsoft..." refers to an erroneous piece of .NET documentation from 2011. Microsoft has corrected the page [1] so that it no longer refers to UTF-7. Since the only remaining purpose for leaving this paragraph in the article is to try to embarrass Microsoft for an earlier mistake, I suggest deleting it. Doug Ewell (talk) 17:58, 14 November 2015 (UTC)[reply]

Done, thanks for pointing that out. BabelStone (talk) 18:59, 14 November 2015 (UTC)[reply]

Web relevance[edit]

Near the top of the article is data about how few web sites use UTF-7, but I don't understand how this is relevant to the topic. Most of the rest of the discussion talks about UTF-7 in the context of e-mail, where it originated. I am minded to move the web-related sentences further down the page into their own section, unless someone suggests otherwise. Stephen lamppost (talk) 00:52, 10 March 2020 (UTC)[reply]

This is now done.Stephen lamppost (talk) 01:20, 14 March 2020 (UTC)[reply]