TSV is a broadly used format for storing tabular knowledge, however it may be complicated when working with textual knowledge and the Pandas library. Two components trigger the confusion:
- TSV is similar to CSV (a widely known format for storing knowledge), however it’s not the identical.
- Pandas default settings are usually not suitable with the TSV format.
Within the story, I briefly focus on the supply of confusion and current the easiest way to deal with the TSV format utilizing the Pandas library.
TSV [1] is an easy file format just like CSV. Nonetheless, there are a number of essential variations:
- It makes use of tabs to separate the fields.
- It doesn’t enable some characters, that’s, line feed (
n
), tabs (t
) and carriage (r
) returns inside fields. - There aren’t any quotations of the fields nor escapes of particular characters [2] (no less than for the unique format).
Level 2 is problematic when coping with textual content fields, as they could comprise the forbidden characters. The steered approach to cope with the forbidden characters is to interchange them with arbitrary texts, like…