csvflatten

csvflatten is a command for producing “flattened” records. Useful for quickly getting a view of records with numerous fields, and for documenting data examples in Markdown-compatible format.

For example, given a data.csv containing this:

id,product,price,description
001,apples,1.50,"An apple is an edible fruit produced by an apple tree (Malus domestica)"
002,oranges,2.25,"An orange is a type of citrus fruit that people often eat. Oranges are a very good source of vitamin C."

csvflatten can be used to transform the data into a “narrow” 2-column denormalized format:

$ csvflatten data.csv
field,value
id,001
product,apples
price,1.50
description,An apple is an edible fruit produced by an apple tree (Malus domestica)
~~~~~~~~~~~,
id,002
product,oranges
price,2.25
description,An orange is a type of citrus fruit that people often eat. Oranges are a very good source of vitamin C.

TK jump to prettify section

Usage reference

-P, --prettify

Print output in Markdown tabular format instead of CSV

TK

A common use case is to produce “pretty” output — e.g. Markdown-table format — which can be easier to skim than a spreadsheet, especially for very wide data:

$ csvflatten -P data.csv
| field       | value                                                 |
| ----------- | ----------------------------------------------------- |
| id          | 001                                                   |
| product     | apples                                                |
| price       | 1.50                                                  |
| description | An apple is an edible fruit produced by an apple tree |
|             | (Malus domestica)                                     |
| ~~~~~~~~~~~ |                                                       |
| id          | 002                                                   |
| product     | oranges                                               |
| price       | 2.25                                                  |
| description | An orange is a type of citrus fruit that people often |
|             | eat. Oranges are a very good source of vitamin C.     |

-L, --max-length <max_length_of_field>

TK Split up values longer than [max_field_length] into multiple row-values as needed.

-R, --rec-id

TK Include a _recid_ column for each row, for easier tracking the 0-based index of each record

-B, --label-chunks

When a long value is split into multiple “chunks”, the field (i.e. first column) is left blank after the first chunk.

Setting the --chunk-labels flag will fill the field column with: “field~n”, where n indicates the n-th chunk of a chopped value

-E, --eor <end_of_record_divider>

TK end of record; When flattening multiple records, separate each records with a row w/ fieldname of [marker]. Set to ‘’ or ‘none’ to disable.

By default, the EOR marker is a series of tildes (~~~~~).

Note: this setting defaults to ‘none’ if -R/--rowid flag is used

High level overview

For every input record, csvflatten’s output will contain 2-column rows — field,value — for each of the record’s key-value pairs. This is useful for viewing records one at a time, especially if each row contains many columns.

It’s a concept similar to xsv flatten, though the structure of “flattened” output differs.

For example, given the following table:

id product price
001 apples 1.50
002 oranges 2.25

The “flattened” view of its 2 records would look like this:

field value
id 001
product apples
price 1.50
   
id 002
product oranges
price 2.25

Basic example

Given a data file, ids.csv:

TK given hamlet.csv: hey

By default, csvflatten produces CSV output without word-wrapping long fields (such as the lines field in the example below). But the most common use-case is to produce pretty tabular output, including word-wrapping long fields to the width of the terminal. This is done using the -P/--prettify flag:

$ csvflatten examples/hamlet.csv -P

| fieldname | value                                          |
| --------- | ---------------------------------------------- |
| act       | 1                                              |
| scene     | 5                                              |
| speaker   | Horatio                                        |
| lines     | Propose the oath, my lord.                     |
| ~~~~~~~~~ |                                                |
| act       | 1                                              |
| scene     | 5                                              |
| speaker   | Hamlet                                         |
| lines     | Never to speak of this that you have seen,     |
|           | Swear by my sword.                             |
| ~~~~~~~~~ |                                                |
| act       | 1                                              |
| scene     | 5                                              |
| speaker   | Ghost                                          |
| lines     | [Beneath] Swear.                               |
| ~~~~~~~~~ |                                                |
| act       | 3                                              |
| scene     | 4                                              |
| speaker   | Gertrude                                       |
| lines     | O, speak to me no more;                        |
|           | These words, like daggers, enter in mine ears; |
|           | No more, sweet Hamlet!                         |
| ~~~~~~~~~ |                                                |
| act       | 4                                              |
| scene     | 7                                              |
| speaker   | Laertes                                        |
| lines     | Know you the hand?                             |

This output is suitable for pasting into a Markdown file to produce a formatted HTML table:

hamlet-markdown-table.png

TK mention how hamlet.csv has new lines

act,scene,speaker,lines
1,5,Horatio,"Propose the oath, my lord."
1,5,Hamlet,"Never to speak of this that you have seen,
Swear by my sword."
1,5,Ghost,[Beneath] Swear.
3,4,Gertrude,"O, speak to me no more;
These words, like daggers, enter in mine ears;
No more, sweet Hamlet!"
4,7,Laertes,Know you the hand?

How it compares to existing tools

Compared to csvkit’s``csvlook``

csvlook doesn’t pretty-format multi-line fields, and can also result in very wide tables without --max-column-width:

$ csvlook examples/hamlet.csv --max-column-width 50

| act | scene | speaker  | lines                                              |
| --- | ----- | -------- | -------------------------------------------------- |
|   1 |     5 | Horatio  | Propose the oath, my lord.                         |
|   1 |     5 | Hamlet   | Never to speak of this that you have seen,
Swea... |
|   1 |     5 | Ghost    | [Beneath] Swear.                                   |
|   3 |     4 | Gertrude | O, speak to me no more;
These words, like dagge... |
|   4 |     7 | Laertes  | Know you the hand?                                 |

Compared to xsv flatten

xsv flatten does do auto-wrapping of long entries, but doesn’t produce tableized output:

$ xsv flatten examples/hamlet.csv

act      1
scene    5
speaker  Horatio
lines    Propose the oath, my lord.
#
act      1
scene    5
speaker  Hamlet
lines    Never to speak of this that you have seen,
Swear by my sword.
#
act      1
scene    5
speaker  Ghost
lines    [Beneath] Swear.
#
act      3
scene    4
speaker  Gertrude
lines    O, speak to me no more;
These words, like daggers, enter in mine ears;
No more, sweet Hamlet!
#
act      4
scene    7
speaker  Laertes
lines    Know you the hand?

Compared to tabulate

python-tabulate is a command-line tool for producing a variety of tabular outputs, including rst, grid, and html formats. However, it does not handle multi-line fields well. Nor does it natively handle the CSV format, e.g. double-quoted values that contain commas, hence, the use of csvkit’s csvformat to change delimiters to \t in the example below:

$ csvformat -T examples/hamlet.csv | tabulate -f grid -1 -s '\t'

+------------------------------------------------+---------+-----------+---------------------------------------------+
| act                                            |   scene | speaker   | lines                                       |
+================================================+=========+===========+=============================================+
| 1                                              |       5 | Horatio   | Propose the oath, my lord.                  |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1                                              |       5 | Hamlet    | "Never to speak of this that you have seen, |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| Swear by my sword."                            |         |           |                                             |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1                                              |       5 | Ghost     | [Beneath] Swear.                            |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 3                                              |       4 | Gertrude  | "O, speak to me no more;                    |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| These words, like daggers, enter in mine ears; |         |           |                                             |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| No more, sweet Hamlet!"                        |         |           |                                             |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 4                                              |       7 | Laertes   | Know you the hand?                          |
+------------------------------------------------+---------+-----------+---------------------------------------------+

That said, if you like tabulate’s table-formatting options, such as -f grid, you can pipe csvflatten (and csvformat to convert to tab-delimiters) into tabulate like so:

$ csvflatten --eor 'none' examples/hamlet.csv | csvformat -T \
    tabulate -f grid -1 -s '\t'

+---------+------------------------------------------------+
| field   | value                                          |
+=========+================================================+
| act     | 1                                              |
+---------+------------------------------------------------+
| scene   | 5                                              |
+---------+------------------------------------------------+
| speaker | Horatio                                        |
+---------+------------------------------------------------+
| lines   | Propose the oath, my lord.                     |
+---------+------------------------------------------------+
| act     | 1                                              |
+---------+------------------------------------------------+
| scene   | 5                                              |
+---------+------------------------------------------------+
| speaker | Hamlet                                         |
+---------+------------------------------------------------+
| lines   | Never to speak of this that you have seen,     |
+---------+------------------------------------------------+
|         | Swear by my sword.                             |
+---------+------------------------------------------------+
| act     | 1                                              |
+---------+------------------------------------------------+
| scene   | 5                                              |
+---------+------------------------------------------------+
| speaker | Ghost                                          |
+---------+------------------------------------------------+
| lines   | [Beneath] Swear.                               |
+---------+------------------------------------------------+
| act     | 3                                              |
+---------+------------------------------------------------+
| scene   | 4                                              |
+---------+------------------------------------------------+
| speaker | Gertrude                                       |
+---------+------------------------------------------------+
| lines   | O, speak to me no more;                        |
+---------+------------------------------------------------+
|         | These words, like daggers, enter in mine ears; |
+---------+------------------------------------------------+
|         | No more, sweet Hamlet!                         |
+---------+------------------------------------------------+
| act     | 4                                              |
+---------+------------------------------------------------+
| scene   | 7                                              |
+---------+------------------------------------------------+
| speaker | Laertes                                        |
+---------+------------------------------------------------+
| lines   | Know you the hand?                             |
+---------+------------------------------------------------+

Reference: Options and usage

-P/--prettify

Print output in tabular format instead of CSV. Unless -L/--max-length is explicitly specified, long values are split into multiple rows based on the current terminal width.

-L/--max-length [integer]

Specify a max character length for field values; values that exceed this length are split into multiple rows. This is useful for producing output easier to view in a spreadsheet:

$ csvflatten -L 5 examples/abc123.csv

field,value
code,alfa
blob,01234
,56789
~~~~~,
code,beta
blob,ABCDE
,FGHIJ

Or combining with the -P/--prettify option:

$ csvflatten -P -L 5 examples/abc123.csv

| field | value |
| ----- | ----- |
| code  | alfa  |
| blob  | 01234 |
|       | 56789 |
| ~~~~~ |       |
| code  | beta  |
| blob  | ABCDE |
|       | FGHIJ |

The default behavior is as follows:

  • Without --prettify mode, this value is set to 0, i.e. no splitting of long values.
  • With --prettify mode, this value is automatically set to the width of the terminal. To disable this behavior, you can explicitly set --max-length 0

-B/--chunk-labels

When a value is chopped into chunks across multiple rows, by default, the field (i.e. first column) is filled in for the value’s first row, then left blank for its subsequent rows:

| field |  value  |
+-------+---------+
| id    | 001     |
| title | this is |
|       | a story |
|       | of love |

If the --chunk-labels flag is set, each subsequent field will be filled with an incremental label, e.g.:

|  field   |  value  |
+----------+---------+
| id       | 001     |
| title    | this is |
| title__1 | a story |
| title__2 | of love |

-E/--eor [END_OF_RECORD_MARKER (string)]

By default, each record is separated by having a string of tildes in field, e.g.:

| field |  value  |
+-------+---------+
| id    | 001     |
| title | this is |
|       | a story |
|       | of love |
| ~~~~~ |         |
| id    | 002     |
| title | Book 2  |

Set 'none' to disable:

| field |  value  |
+-------+---------+
| id    | 001     |
| title | this is |
|       | a story |
|       | of love |
| id    | 002     |
| title | Book 2  |

Or to a value of your choosing:

$ csvflatten -E 'NEW-RECORD' data.csv

|   field    |  value  |
+------------+---------+
| id         | 001     |
| title      | this is |
|            | a story |
|            | of love |
| NEW-RECORD |         |
| id         | 002     |
| title      | Book 2  |

-R/--rec-id

Include a _recid_ column that contains the 0-based index for the respective record:

| _recid_ | field |  value  |
+---------+-------+---------+
|       0 | id    | 001     |
|       0 | title | this is |
|       0 |       | a story |
|       0 |       | of love |
|       1 | id    | 002     |
|       1 | title | Book 2  |

Note that -rec-id by default disables the end-of-record separator

Common scenarios and use cases

Making multiline tweets easier to read

about the data TK

TK Raw data comes from Twitter’s API and using twarc’s flattened CSV version:

$ twarc timeline POTUS --format csv

Reference file: tweets-newlines.csv

TK: image of data in excel

$ csvflatten -P examples/tweets-newlines.csv

First two tweets, in flattened format:

.. code-block:: text
field | value |
---------------------------- | ------------------------------------------------------ |
id | 1196869405629702148 |
created_at | Tue Nov 19 19:14:37 +0000 2019 |
parsed_created_at | 2019-11-19 19:14:37+00:00 |
user_screen_name | POTUS |
text | Since the day President @realDonaldTrump took office, |
| House Democrats have ignored working Americans to |
| focus on their only real agenda item: impeachment. |
| Meanwhile, our country created $11 trillion in worth. |
tweet_type | retweet |
coordinates | |
hashtags | |
media | |
urls | |
favorite_count | 40704 |
in_reply_to_screen_name | |
in_reply_to_status_id | |
in_reply_to_user_id | |
lang | en |
place | |
possibly_sensitive | |
retweet_count | 12547 |
retweet_or_quote_id | 1196864730226401280 |
retweet_or_quote_screen_name | WhiteHouse |
retweet_or_quote_user_id | 822215673812119553 |
source | <a href=”https://www.sprinklr.com” |
| rel=”nofollow”>TheWhiteHouse</a> |
user_id | 822215679726100480 |
user_created_at | Thu Jan 19 22:54:28 +0000 2017 |
user_default_profile_image | False |
user_description | 45th President of the United States of America, |
| @realDonaldTrump. Tweets archived: |
user_favourites_count | 104 |
user_followers_count | 32630758 |
user_friends_count | 39 |
user_listed_count | 29517 |
user_location | Washington, D.C. |
user_name | President Trump |
user_statuses_count | 10928 |
user_time_zone | |
user_urls | http://WhiteHouse.gov |
user_verified | True |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
id | 1240016437248438276 |
created_at | Tue Mar 17 20:45:31 +0000 2020 |
parsed_created_at | 2020-03-17 20:45:31+00:00 |
user_screen_name | POTUS |
text | It’s time for all Americans to do their part to stop |
| the transmission of the Coronavirus. We will come |
| out stronger than ever before! https://t.co/u0SHHpRJ0l |
tweet_type | retweet |
coordinates | |
hashtags | |
media | |
urls | |
favorite_count | 15363 |
in_reply_to_screen_name | |
in_reply_to_status_id | |
in_reply_to_user_id | |
lang | en |
place | |
possibly_sensitive | |
retweet_count | 3779 |
retweet_or_quote_id | 1240016322920091655 |
retweet_or_quote_screen_name | WhiteHouse |
retweet_or_quote_user_id | 822215673812119553 |
source | <a href=”https://www.sprinklr.com” |
| rel=”nofollow”>TheWhiteHouse</a> |
user_id | 822215679726100480 |
user_created_at | Thu Jan 19 22:54:28 +0000 2017 |
user_default_profile_image | False |
user_description | 45th President of the United States of America, |
| @realDonaldTrump. Tweets archived: |
user_favourites_count | 104 |
user_followers_count | 32630764 |
user_friends_count | 39 |
user_listed_count | 29517 |
user_location | Washington, D.C. |
user_name | President Trump |
user_statuses_count | 10928 |
user_time_zone | |
user_urls | http://WhiteHouse.gov |
user_verified | True |