Skip to content Skip to sidebar Skip to footer

How To Remove Weird Encoding From Txt File

I am trying to process text files like this one: http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt If you see around the middle of the file

Solution 1:

The encoding you are looking at is uuencode. In Python, you would use the uu module to decode this blob, or simply stringdata.decode('uu').

uuencode is a legacy format which was originally used to embed binaries in email (which then only permitted 7-bit US-ASCII; the format also has some concessions for interoperability with big-iron systems of the day which used their own bewildering character encodings). These days, you would expect to see base64 in this role.

I posted an answer to the followup question which shows how to remove uuencode blobs while reading from a filehandle or iterating over a bunch of lines of text.


Solution 2:

The problem can efficiently be solved using the sed command as provided here : sed command - apply in all text (.txt) files of folder


Post a Comment for "How To Remove Weird Encoding From Txt File"