Wednesday, June 14, 2006

this api can break

Over at WinCustomize, someone thought they'd found an Easter Egg in the Windows Notepad application. If you:
  1. Open Notepad
  2. Type the text "this app can break" (without quotes)
  3. Save the file
  4. Re-open the file in Notepad
Notepad displays seemingly-random Chinese characters, or boxes if your default Notepad font doesn't support those characters.

It's not an Easter egg (even though it seems like a funny one), and as it turns out, Notepad writes the file correctly. It's only when Notepad reads the file back in that it seems to lose its mind.

But we can't even blame Notepad: it's a limitation of Windows itself, specifically the Windows function that Notepad uses to figure out if a text file is Unicode or not.

You see, text files containing Unicode (more correctly, UTF-16-encoded Unicode) are supposed to start with a "Byte-Order Mark" (BOM), which is a two-byte flag that tells a reader how the following UTF-16 data is encoded. Given that these two bytes are exceedingly unlikely to occur at the beginning of an ASCII text file, it's commonly used to tell whether a text file is encoded in UTF-16.

But plenty of applications don't bother writing this marker at the beginning of a UTF-16-encoded file. So what's an app like Notepad to do?

Windows helpfully provides a function called IsTextUnicode()--you pass it some data, and it tells you whether it's UTF-16-encoded or not.

Sorta.

It actually runs a couple of heuristics over the first 256 bytes of the data and provides its best guess. As it turns out, these tests aren't terribly reliable for very short ASCII strings that contain an even number of lower-case letters, like "this app can break", or more appropriately, "this api can break".

The documentation for IsTextUnicode says:

These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpBuffer points to the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, though failure would be preferable.

Indeed.

As a wise man once said, "In the face of ambiguity, refuse the temptation to guess."

75 comments:

Anonymous said...

So this is really an MS error ;p, stupid m$

Anonymous said...

i stabbed Microsoft in the face when i read this.

Anonymous said...

I tested this in Notepad, and yes; Notepad is broken.
Then I tested it with my text editor of choice, EDXOR, and the text read fine.

Anonymous said...

Notepad is full of bugs.

Anonymous said...

Anonymous said...

So this is really an MS error ;p, stupid m$



'M$' wrote notepad, so it was either way.

Anonymous said...

"this windows will break"

Anonymous said...

it only happens the first time you save and reopen. change the characters, save, close, open, change back to 'this app can break', save, close.. open, and Notepad somehow reads it normally again. what the fuck

Anonymous said...

This is probably the funniest thing I've ever read :)

Anonymous said...

people that spell microsoft using a $ sign are grade a retard$.

Anonymous said...

people that spell microsoft using a $ sign are grade a retard$.

Anonymous said...

By the time you open it again to change the characters, instead of "Save" (Ctrl-S), go to File->Save As. The encoding has already been set to "Unicode", hence what you describe. Besides, check the length of the file, it's 38 bytes not 18 anymore.

>>>> it only happens the first time you save and reopen. change the characters, save, close, open, change back to 'this app can break', save, close.. open, and Notepad somehow reads it normally again. what the fuck

Anonymous said...

Maybe I missed the point of post but... it isnt an error and it isnt MS's fault(or notepad for that matter). Notepad is looking out for 3d party broken apps, and the test for these 3rd party broken apps is not fool proof as documented... you jsut can't write something better if you are going to take into account the millions of ignorant coders out there like MS does. Better luck next time M$ H4ters!!!1

John H. said...

The 'bug' only occures when you save it as ASCII/ANSI instead of Unicode, if you specify Unicode when saving the result doesn't differ from the proper behaviour. One is also able to select the encoding type when opening a file, and therefore when windows detects it as unicode, but it really isn't one is able to force an encoding to follow when opening it.

Anonymous said...

It's actually my fault

Anonymous said...

this even works in Vista Public Release Beta 2 (5483). seems like they are doing the same thing, or havent changed it in Vista.

Bharath said...

Naa, my notepad works fine!
Tested it many times, no problem here!

Anonymous said...

Well it works fine with windows notepad using wine on linux :-)

Anonymous said...

I can't use notepad for anything any more. When you save, it doesn't just save the text from the window -- it saves that, then rebuilds the text widget. If you have word-wrap enabled, it rewraps everything, but doesn't update the display. So then if you position the cursor or select something, and type new text, you've just mutilated your document. (Select everything and unselect it to see what text is really there.)

Yay Windows...

Anonymous said...

Those are pretty insightful comments that this post attracted here. Not that would pretend to have more wit or relevance...

This problem is impossible to solve in case of text files. Microsoft prefers backward compatibility than consistency. Had they decided to turn every piece of string on file or in memory 16 bit unicode, no exception, this would not happen.

I know, this would break the 657 trillions of legacy Dos and Windows 3.11 apps out there. (wow, windows3.11, anyone remember how great this os and those applications used to run?) but at least notepad would open text file just fine (provided it's not one of those obsolete 8-ascii encoded text files)

At least, when a shortcoming, or flaw, or bug, is documented, it becomes a feature. Microsoft would stamp that one in red ink: "This behavior is by design"

Anonymous said...

Shows how silly utf-16 is too :)
Atleast they could have specified an endian to use.

Please just use UTF-8 guys.

Anonymous said...

Only Microsoft could have bugs in an app as basic as Notepad....

Anonymous said...

Notepad would be best if restricted to plain old text in ASCII. Perhaps M$FT needs to ship something like TextPad now that Notepad is clearly unreliable.

Anonymous said...

16 bit character encoding is waste for english and non asian languages. Actualy, in ASCII, only the first 4 bits are standard control char's, numbers letters and symbols. The other 4 bits are just extra sybols (extended ASCII).

So in my opinion, as a programmer, 8 bit is standard text at 50% waste, and 16 bit is for multi lingual compatibility at 75% waste.

Anonymous said...

Ok, I tried this and used ASNI encoding and hit 'save as'.

I got a pop up window that said the text had changed and did I still want to save it. I said yes.

The file changed to 'Oscar' and when I opened it, it read 'Oscar was here'.

How odd !

Anonymous said...

Try copying the Chinese characters and pasting them to Google's translation engine at http://www.google.com/language_tools and you get a Chinese-to-English translation of "After indignant grey harassment personal sounded Fun" lol.

Anonymous said...

It's kind of funny how you hear people complaining about this very minor and meaningless "issue." "Oh my God! It's a M$ issue!!! They suck!" Please tell me when you really plan on saving a text file with that exact "xxxx xxx xxx xxxxx" combination? It would hurt to add an extra space or something if you were? GG, stop hating on Microsoft.

Anonymous said...

>> Actualy, in ASCII, only the first 4 bits are standard control char's, numbers letters and symbols. The other 4 bits are just extra sybols (extended ASCII).

Ow. My head exploded. Dear God, the ignorance...

Anonymous said...

Bill is still richer than you. Keep focusing on these things:)

Anonymous said...

I tried this out, and it worked provided I chose ANSI and not Unicode. Pretty funny bug though, would have thought something like that would be fixed given how old notepad is. Well, lets hope its fixed for vista eh? :)

Anonymous said...

Heh heh heh.

"It's kind of funny how you hear people complaining about this very minor and meaningless "issue." "Oh my God! It's a M$ issue!!! They suck!""

Yeah. It's not like it's the operating system's job to keep from trashing your data.

"Please tell me when you really plan on saving a text file with that exact "xxxx xxx xxx xxxxx" combination?"

Amen! Who uses notepad to store short text strings? And what are the odds they'd have an even number of characters? I mean really.

"It would hurt to add an extra space or something if you were? GG, stop hating on Microsoft."

Exactly. The first thing I do when I open a text file and am greeted by cryptic Chinese characters is re-open the file with a different encoding, and then add a space and re-save, so Windows doesn't screw it up again next time. ...doesn't everybody?

Anonymous said...

Try putting "Bush hid the facts" without the quotes. More hànzì mania!

Anonymous said...

About the guy who said that 4 bits of a byte are extended ASCII:

Only one bit is...
And you need the other seven for holding everything you can type on a keyboard and some other things(not four, since then you would only be able to store 8 combinations)..

Anonymous said...

Try this:

Open Notepad in Windows
Type "Bush hid the facts" without the quotes.
Click the close box.
Save the file as test.txt
Open the file in Notepad.
Copy the boxes into http://babelfish.altavista.com/tr
Choose "Chinese-simp to English"

See what Bush did!

Anonymous said...

I don't get it, i tried to to what the author said... But when i opened Notepad again, the text was there, and readable.

Anonymous said...

This also works with
"I'll use the linux"
without the quotes.

Anonymous said...

It's because you hit return after typing in the text, just like the retarded winblow$ user that you are.

Anonymous said...

The same error occurs in 'Notepad2'


If you don't know what it is...Google it.

Anonymous said...

after u reopen it copy paste the text in word and u get 畂桳栠摩琠敨映捡獴

Anonymous said...

It doesn't work with some combination of numbers like "1234 123 123 12345", but works on "1111 111 111 11111", funny !:)

Anonymous said...

>> Actualy, in ASCII, only the first 4 bits are standard control char's, numbers letters and symbols. The other 4 bits are just extra sybols (extended ASCII).

omg, man - you need _at least_ 26 symbols, not 16

Anonymous said...

Well, I tried it with a few different strings and got 畢桳栠摩琠敨映捡獴 (bush hid the facts), 潤杵搠杩氠牥映敲歡 (this app can break), and 楰獳挠湡猠瑥映物獥 (piss can set fires).

- said...

Another thing about Notepad, which I learned from the Help file (Yes, Notepad Help file actually taught me something!)

Create a new file, and as the first line, type:

.LOG

Save and open the file (as a .txt is fine). Notepad appends the current time and date (like hitting F5 in Notepad) every time you open the file.

I've actually used this: for example, if you're working on a project, every time you do something, just open the file, type whatever, and save it. It saves with a timestamp.

Hey, there's more to Notepad than you think! :)

And hey, why doesn't Notepad write those first two UTF-16 bytes?

Tim Lesher said...

And hey, why doesn't Notepad write those first two UTF-16 bytes?

It does, if you save the file as Unicode (in the "Save as Type" field of the Save dialog).

Anonymous said...

Actualy, in ASCII, only the first 4 bits are standard control char's, numbers letters and symbols. The other 4 bits are just extra sybols (extended ASCII).

Yes I posted this. And yes it is true, for the couple of binary impared posters:

4 bits can store a value from 0 to 127 and 8 bits can store a value from 0 to 255.

The 8 bit ASCII has all special controll characters up untill 32, wich represents a space. (We'll avoid Hex to prevent further confusion) From there there are a few common symbols, then numbers, then a few more symbols, followed by capitol letters in order, a couple more symbols the lower case letters, and a couple more symbols. That takes care of the first 4 bits, 0 to 127.

Nothing above 127 (the first 4 bits) can be produced by a standard English language keyboard. (Anything special uses two characters to represent things such as the arrow keys)

For those of you unfammilliar with ASCII you can see a chart on www.lookuptables.com

Anonymous said...

"Yes I posted this. And yes it is true, for the couple of binary impared posters:

4 bits can store a value from 0 to 127 and 8 bits can store a value from 0 to 255."


I think you should be careful who you call binary impaired. How exactly do you represent 127 in 4 bits?

Tim Lesher said...

4 bits can store a value from 0 to 127 and 8 bits can store a value from 0 to 255.

Yikes. I hope you haven't done any life-critical programming.

4 bits can store a value from 0 to 15 (1 + 2 + 4 + 8).
7 bits can store a value from 0 to 127 (1 + 2 + 4 + 8 + 16 + 32 + 64).
8 bits can store a value from 0 to 255.

Anonymous said...

Hey man ..this will happen for any this kind of string..... "aaaa aaa aaa aaaaa"

Anonymous said...

"this app can break" is your example, but this went viral with "bush hid the facts". In a Google search there are almost ten times more hits about "bush hid the facts" than "this app can break"

You are missing the really weird strange conspiracy aspect of this bug. When you force the bug to happen, save a notepad file using the random letters and letters of the alphabet, in the form zzzz zzz zzzz zzzzz, four letters, space, three letters, space, three letters, space, five letters. You close it and then reopen it. Nine zeros appear or Chinese symbols. Copy those zeros or symbols and paste them into a translator. Here’s the translator used:

http://perso.orange.fr/
gaoling/hanzi/index.htm

The zeros appear in the translator as Chinese characters. Translate those characters and up pops a bunch of words, write the words down and look at them.

z - Air; sky; empty; air force; in vain, air defense, low altitude, anti-aircraft defense.

y - A sacrifice at the beginning of a military campaign, cult. Manifest.

x - Rocky hill.

w - Eye; look, see; wide open eyes; to gaze in astonishment, blind; unperceptive, shortsighted, straight, erect, vertical, close eyes, sleep; hibernate, hypnotism; mesmerism, people, subjects, citizens.

v - Fatigued, head sores, chronic disease, chronic illness; sorrow. Legs.

u - Field.

t - Twin gems.

s - Commit crime, violate; criminal.

q - Fire, flame, burn, anger, rage, ashes, dust, lime, to roast, spirit, soul, spiritual world. Sacrificial animal.

p - Small river, gray ash, dust, engineer, physicist, head covered in dust.

That's a pretty amazing coincidence to 911.

I hope you don't delete this, it's too good a conspiracy theory. And please don't get mad at me, I'm just pointing it out.

Here's the whole story:

http://www.dreamslaughter.com/
parasentient/parasentient.htm

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

C++ programmer said...

Actualy, in ASCII, only the first 4 bits are standard control char's, numbers letters and symbols. The other 4 bits are just extra sybols (extended ASCII).

Yes I posted this. And yes it is true, for the couple of binary impared posters:

4 bits can store a value from 0 to 127 and 8 bits can store a value from 0 to 255.


I wonder if C++ programmer wrote the character set recognition function used by notepad.

Anonymous said...

vi!

Anonymous said...

And how does some linux-progs know which encoding some txt file uses?

If the system is set to use UTF-16, then it would open txt file which was wrote under ISO-8869-15 system as it would have beein wrote under UTF-16 system?
So wouldn't this same thing happen?

...or something. :E

Anonymous said...

Okay, so I tested that.

Did a txt file which contained this " 湡" in a UTF-8 system.
Then send it to ISO-8859-15 system and opened it.
What it read whas this:" 湡"

"Lol, look at that, stupid linux fails at that. But you can't really except anything from a FREE O/S."

Anonymous said...

Stop talking bull. It is an intentional easter egg.

Anonymous said...

when i typed it in
i then put in google for transation and it came out with:

After ? indignant grey harassment ? personal sounded Fun
^_________________________________^
got me confused, and i think there may be somehting going on.

Anonymous said...

so what does : "After ? indignant grey harassment ? personal sounded Fun". mean then anyway?

Anonymous said...

Brandon Turner said ...

but ... it isnt an error and it isnt MS's fault

It isn't an error??? Notepad is corrupting the original plain ascii information. Of course, it's an error, and of course it is MS's fault.

The fix is to guess the encoding when the file is selected in the input browser but allow the user to fix it. Guessing and not giving the user an option to correct the guess is stupid. Compare the Save As behaviour in IE. Better.

Anonymous said...


i don't get it...

hhhh hhh hhh hhhhh

also works?

Anonymous said...

hehe, "gods are not smart" turned into: "the worried thoughts knock the ditch"

Anonymous said...

z0mg |\/|$ r teh nub LOLOLOLOL

Now come on, seriously... Being silly isn't the answer.

Anonymous said...

This seems prety long but still works
“Einstein's thought regarding mathematics motivated Dhilung Kirat thinkin mathematics wonderfully amazing languagez”

http://dhilung.blogspot.com/2006/08/technotepad-facts-behind-bush-hid.html

Anonymous said...

But how is Wordpad able to open it correctly?

Anonymous said...

>>Amen! Who uses notepad to store short text strings? And what are the odds they'd have an even number of characters? I mean really.

I'm no mathemathician, but I'm pretty sure it's close to 1 in 2.
---

>>>I don't get it, i tried to to what the author said... But when i opened Notepad again, the text was there, and readable.

the reply
>>>It's because you hit return after typing in the text, just like the retarded winblow$ user that you are.


As stated by the famous anonymous above...

>>>>By the time you open it again to change the characters, instead of "Save" (Ctrl-S), go to File->Save As. The encoding has already been set to "Unicode", hence what you describe. Besides, check the length of the file, it's 38 bytes not 18 anymore.

>>>>>>>> it only happens the first time you save and reopen. change the characters, save, close, open, change back to 'this app can break', save, close.. open, and Notepad somehow reads it normally again. what the fuck


To the retarded linux/apple user:
Don't go all m$/micro$oft/Windoze or whatever unorignal insult your *nix self can't think up to people who use an OS that *you* obviously use as well. Most people get by just fine using windows to surf the web, read email, download porn, etc. I can hardly think someone qualifies as a retard because they use a certain OS that you claim you don't (because I sure as hell don't go to websites and make fun of bugs in other OS's that I don't even have....
That... would just be plain stupid.)
---

>>omg, man - you need _at least_ 26 symbols, not 16

heh 26. maybe if your only like to type in only one case for the alphabet. Never use the enter key, tab or symbols like!@#$...or the 100+ other oddball charters you never usually type.
---

>>>sinsanity2006 said...
You are missing the really weird strange conspiracy aspect of this bug. When you force the bug to happen, save a notepad file using the random letters and letters of the alphabet, in the form zzzz zzz zzzz zzzzz, four letters, space, three letters, space, three letters, space, five letters. You close it and then reopen it. Nine zeros appear or Chinese symbols.... [blah blah blah]

That's a pretty amazing coincidence to 911.

I hope you don't delete this, it's too good a conspiracy theory.


Notepad uses the function IsTextUnicode(), first came with Windows NT 3.5
An OS made in 1994. Now I know you may have faith on Microsoft's ability to control the future of everything, but I disagree in their ability to foresee who even the president would be back on September 21, 1994.

Oh and those zeros you mention. They're called rectangles.
---

>>>>Of course, it's an error, and of course it is MS's fault.

>>>The fix is to guess the encoding when the file is selected in the input browser but allow the user to fix it. Guessing and not giving the user an option to correct the guess is stupid. Compare the Save As behaviour in IE. Better.



Or just use the save as feature in notepad as stated in an above post.
---

>>>>But how is Wordpad able to open it correctly?

Change the extention to .txt or open any wordpad file with notepad and you'll see all of the extra junk that tells it how it should be read.
Notepad is just meant to be a simple basic editor.
Even a newly created wordpad file that has nothing in it(or just about any program file that's not a simple editor like notepad), when opened with notepad will show you all that junk you see.
---
Disclaimer: Keep in mind I don't know everything about computers and some or all of the above statements made by me may be more than likely completely wrong... :P


...except for the ones about the arrogant linux guy.

Anonymous said...

Traduction by google language tools:

畂桳栠摩琠敨映捡獴 (Bush hid the facts)

=

畂 桳 Hour Morocco by showing video games seized mongoose

Anonymous said...

from the article:
"You see, text files containing Unicode (more correctly, UTF-16-encoded Unicode) are supposed to start with a "Byte-Order Mark" (BOM), which is a two-byte flag that tells a reader how the following UTF-16 data is encoded. Given that these two bytes are exceedingly unlikely to occur at the beginning of an ASCII text file, it's commonly used to tell whether a text file is encoded in UTF-16.

But plenty of applications don't bother writing this marker at the beginning of a UTF-16-encoded file."


from a comment [by Brandon Turner]:
Maybe I missed the point of post but... it isnt an error and it isnt MS's fault(or notepad for that matter). Notepad is looking out for 3d party broken apps, and the test for these 3rd party broken apps is not fool proof as documented... you jsut can't write something better if you are going to take into account the millions of ignorant coders out there like MS does. Better luck next time M$ H4ters!!!1

[other uninformed people's comments ...]


Since no one seems to have said this clearly, I'll state here:
1) UTF-16 character encoding schemes DO NOT compel the use of any Byte Order Mark;
2) the character encoding schemes for UTF-16 are 3 (three): UTF-16 (sic), UTF-16BE and UTF-16LE;
3) the Byte Order Mark MAY (not MUST) be used only in the first encoding scheme (UTF-16);
4) in the most used Operating Systems, for File Systems 'use', plain text files are ALWAYS written WITHOUT any character encoding scheme specification;
5) WITHOUT character encoding scheme specification, any UTF-16 (character encoding scheme) plain text file WITHOUT Byte Order Mark (which is not mandatory, I repeat) IS EQUAL TO the same file either UTF-16BE-encoded or UTF-16LE-encoded (if someone asked: UTF-16BE and UTF-16LE don't need at all a Byte Order Mark since their names carry this specification; but Operating Systems don't save this specification anywhere in File Systems, so ...);
6) all this can be used to correctly infer that, speaking of common Operating Systems with their common File Systems with plain text files:
a) there is no such discrimination like "UTF-16 or UTF-16BE or UTF-16LE", since the discrimination is purerly based upon these names, which aren't saved anywhere [and so on ...], and
b) you can simply speak of UTF-16-encoded files WITH or WITHOUT Byte Order Mark;
7) then, good plain text file editors MUST check every file for UTF-16 (and only-god-knows-how-many-more) recognition schemes: if they find the Byte Order Mark they are pretty sure of the 'UTF-16-ness' of the file and of the related byte order, if not ..., they must invent something (pop-up with most probable charsets to select from isn't a bad idea at all, you very smart reader ;);
8) notepad IS NOT a good plain text file editor: it can't even make you choose more than 2 (two!!!) charsets, an anonymous Unicode (there are A LOT of Unicode encoding schemes!!!) choice and an equally anonymous ANSI choice, it is good solely for really basic files (and not too basic, we happen to know ;P), it is only a shame that Chinese is preferred upon any latin-alphabet written language by a Windows function, by default, in a system without CJK support: bed (actually VEEERY bed) implementation, no more ... ^_^'''

Thompson Lee said...

First of all, this notepad has some weird codes in it.

When I did this, I come up with different phrases in Chinese.

Read and learn:

This api will break:

Nothing


This app will break:

桴獩愠楰挠湡戠敲歡


aaaa aaa aaa aaaaa:

慡慡愠慡愠慡愠慡慡


The first four letters contains 00000101:hex 16

The next three letters contains:
00000100:hex 14

The next three letters contains:
00001101:hex 9

The last four letters contains:
00001111:hex 1

Add the hex numbers and get 40.

Divide it by 4 (there are four words) and get 10.

The bytes shown are 101, 100 1101 1111. So, add them by their digits:

1+0+1+1+0+0+1+1+0+1+1+1+1+1= 10.


Why 10? Continue on:

If unicode text is 10, then the program will guess the word is a symbol. IsTextUnicode() might have it in another way.

Right now, I'm asking a professional, so I still don't have the answers YET.

Anonymous said...

Translated text i git when i typed it.. Mount Albert 桳 Hour by showing video games seized mongoose

Anonymous said...

I got:
After the shooting ash torsion Hui Yu Huan renowned knockout
from "this app can break".
Holy!!!!

Anonymous said...

Surely clearing up the while CR LF debacle is more important than sorting out this prob?
I still can't open text files in notepad. The look like sh1t

Unknown said...

final nail in the coffin of C++ programmer's argument - "4 bits can store a value from 0 to 127". Here is a complete run down of the possible values stored in 4 bits.

0000
1000
0100
1100
0010
1010
0110
1110
0001
1001
0101
1101
0011
1011
0111
1111

I count 16. case closed.

Anonymous said...

I decided to try something out with the Chinese characters.

Actually, just download a free copy of Babylon 7 and insert the entire Chinese character code.

桴獩愠灰挠湡戠敲歡

The translation to English is "this app can break".

Therefore one can conclude that the language code in the ASCII header has been saved for CHINESE and not for English which is creating this type of error. Sounds pretty easy to fix... for a Notepad programmer.

So why all the fuss? Just fix it and post a free copy of both the source code and modified application for download on the net and then embarass M$ if you want that way.

Anonymous said...

This also works with
"Билл Гейтс самый умный"
without the quotes.

diecast cars said...

hi! thanks for letting me comment! really enjoyed reading this blog too, was of great help, using the tips I realized they are the best

Hampers said...

Just tried this and it didnt worked - then noticed the post was 4 years old!! Will teach me to read the blog post date before reading the article in future.

Anonymous said...

hmm it stil works for bush hid... my characters translate to "microsoft and communism do not agree".