How To Remove Weird Encoding From Txt File

August 30, 2022 Post a Comment

I am trying to process text files like this one: http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt If you see around the middle of the file

Solution 1:

The encoding you are looking at is uuencode. In Python, you would use the uu module to decode this blob, or simply stringdata.decode('uu').

uuencode is a legacy format which was originally used to embed binaries in email (which then only permitted 7-bit US-ASCII; the format also has some concessions for interoperability with big-iron systems of the day which used their own bewildering character encodings). These days, you would expect to see base64 in this role.

I posted an answer to the followup question which shows how to remove uuencode blobs while reading from a filehandle or iterating over a bunch of lines of text.

Solution 2:

The problem can efficiently be solved using the sed command as provided here : sed command - apply in all text (.txt) files of folder

Html5 Lite

How To Remove Weird Encoding From Txt File

Solution 1:

Solution 2:

Post a Comment for "How To Remove Weird Encoding From Txt File"