Greptastic. • 1 February 2012 • The SnowBlog


Flushed with success from the overwhelming response to my previous geeky post, I am thrilled to bring you something else that you'll never use, but if you did would save you masses of time. Do you ever have an XML file which isn't validating because of invisible non-ASCII characters? Yeah? You do? I bet you do! Allow me to help! Open the Terminal application. Change into the directory which contains your buggy file. cd path/to/buggy/file Run the following command: grep -n -P "[\x80-\xFF]" file.xml That syntax, broken down: grep -n - gives the line number of the problem in the output -P - tells the computer that there is a Perl regular expression coming up. Perl is a programming language. That's about all I know about it. Doesn't stop me from using it. "[\x80-\xFF]" - the pattern you want the computer to look for. This will find all characters which are in the range 0x80 to 0xFF.* file.xml - the name of the file you want to search. Terminal will run through your file, and report on any characters that match that grep so you can gleefully zap them. Primary source. You should spend time on Stackoverflow. It's nerdalicious. ---------------- * Eh? you say. Characters don't look like that. They look like this: "a", or "d", or, rather exotically, "z". Yes they do, but computers are funny. Computers like lists, and order, and to know exactly what's what. A computer likes to use a system known as ASCII (American Standard Code for Information Interchange) just to be on the safe side, so that if it's handed a character, it can be sure it knows which character you mean. If you use a character on the ASCII list, you know that your computer - and pretty much any computer software - is going to be able to handle it. The ASCII list assigns values to the most-used characters, using a variety of codes. The letter 'a', for instance, is "61" in hexadecimal code. 0x80 (one of the codes we use in the pattern, here) is a computer-y way of saying the number you probably know as 128, or CXXVIII if you're still into Roman numerals, or 80 if you've been looking at the hexadecimal color picker in Photoshop for too long. 0xFF is 255, or CCLV, or FF. Those pesky non-ASCII characters which don't display on your screen and which bugger up your XML are all in that range.


The SnowBlog is one of the oldest publishing blogs, started in 2003, and it's been through various content management systems over the years. A 2005 techno-blunder meant we lost the early years, but the archives you're reading now go all the way back to 2005.

Many of the older posts in our blog archive suffer from link rot. Apologies if you see missing links and images: let us know if you'd like us to find any in particular.

Read more from the SnowBlog...

« Preparedness 101: Zombie Pandemic
Ruckus in the world of academic journals »