csvflatten¶
csvflatten is a command for producing “flattened” records. Useful for quickly getting a view of records with numerous fields, and for documenting data examples in Markdown-compatible format.
For example, given a data.csv
containing this:
id,product,price,description
001,apples,1.50,"An apple is an edible fruit produced by an apple tree (Malus domestica)"
002,oranges,2.25,"An orange is a type of citrus fruit that people often eat. Oranges are a very good source of vitamin C."
— csvflatten can be used to transform the data into a “narrow” 2-column denormalized format:
$ csvflatten data.csv
field,value
id,001
product,apples
price,1.50
description,An apple is an edible fruit produced by an apple tree (Malus domestica)
~~~~~~~~~~~,
id,002
product,oranges
price,2.25
description,An orange is a type of citrus fruit that people often eat. Oranges are a very good source of vitamin C.
TK jump to prettify section
Table of contents
Usage reference¶
-P, --prettify
¶
Print output in Markdown tabular format instead of CSV
TK
A common use case is to produce “pretty” output — e.g. Markdown-table format — which can be easier to skim than a spreadsheet, especially for very wide data:
$ csvflatten -P data.csv
| field | value |
| ----------- | ----------------------------------------------------- |
| id | 001 |
| product | apples |
| price | 1.50 |
| description | An apple is an edible fruit produced by an apple tree |
| | (Malus domestica) |
| ~~~~~~~~~~~ | |
| id | 002 |
| product | oranges |
| price | 2.25 |
| description | An orange is a type of citrus fruit that people often |
| | eat. Oranges are a very good source of vitamin C. |
-L, --max-length <max_length_of_field>
¶
TK Split up values longer than [max_field_length] into multiple row-values as needed.
-R, --rec-id
¶
TK Include a _recid_ column for each row, for easier tracking the 0-based index of each record
-B, --label-chunks
¶
When a long value is split into multiple “chunks”, the field (i.e. first column) is left blank after the first chunk.
Setting the --chunk-labels flag will fill the field column with: “field~n”, where n indicates the n-th chunk of a chopped value
-E, --eor <end_of_record_divider>
¶
TK end of record; When flattening multiple records, separate each records with a row w/ fieldname of [marker]. Set to ‘’ or ‘none’ to disable.
By default, the EOR marker is a series of tildes (~~~~~).
Note: this setting defaults to ‘none’ if -R/--rowid flag is used
High level overview¶
For every input record, csvflatten’s output will contain 2-column rows — field,value
— for each of the record’s key-value pairs. This is useful for viewing records one at a time, especially if each row contains many columns.
It’s a concept similar to xsv flatten, though the structure of “flattened” output differs.
For example, given the following table:
id | product | price |
---|---|---|
001 | apples | 1.50 |
002 | oranges | 2.25 |
The “flattened” view of its 2 records would look like this:
field | value |
---|---|
id | 001 |
product | apples |
price | 1.50 |
id | 002 |
product | oranges |
price | 2.25 |
Basic example¶
Given a data file, ids.csv:
TK given hamlet.csv: hey
By default, csvflatten produces CSV output without word-wrapping long fields (such as the lines
field in the example below). But the most common use-case is to produce pretty tabular output, including word-wrapping long fields to the width of the terminal. This is done using the -P/--prettify
flag:
$ csvflatten examples/hamlet.csv -P
| fieldname | value |
| --------- | ---------------------------------------------- |
| act | 1 |
| scene | 5 |
| speaker | Horatio |
| lines | Propose the oath, my lord. |
| ~~~~~~~~~ | |
| act | 1 |
| scene | 5 |
| speaker | Hamlet |
| lines | Never to speak of this that you have seen, |
| | Swear by my sword. |
| ~~~~~~~~~ | |
| act | 1 |
| scene | 5 |
| speaker | Ghost |
| lines | [Beneath] Swear. |
| ~~~~~~~~~ | |
| act | 3 |
| scene | 4 |
| speaker | Gertrude |
| lines | O, speak to me no more; |
| | These words, like daggers, enter in mine ears; |
| | No more, sweet Hamlet! |
| ~~~~~~~~~ | |
| act | 4 |
| scene | 7 |
| speaker | Laertes |
| lines | Know you the hand? |
This output is suitable for pasting into a Markdown file to produce a formatted HTML table:
TK mention how hamlet.csv has new lines
act,scene,speaker,lines
1,5,Horatio,"Propose the oath, my lord."
1,5,Hamlet,"Never to speak of this that you have seen,
Swear by my sword."
1,5,Ghost,[Beneath] Swear.
3,4,Gertrude,"O, speak to me no more;
These words, like daggers, enter in mine ears;
No more, sweet Hamlet!"
4,7,Laertes,Know you the hand?
How it compares to existing tools¶
Compared to csvkit’s``csvlook``¶
csvlook doesn’t pretty-format multi-line fields, and can also result in very wide tables without --max-column-width
:
$ csvlook examples/hamlet.csv --max-column-width 50
| act | scene | speaker | lines |
| --- | ----- | -------- | -------------------------------------------------- |
| 1 | 5 | Horatio | Propose the oath, my lord. |
| 1 | 5 | Hamlet | Never to speak of this that you have seen,
Swea... |
| 1 | 5 | Ghost | [Beneath] Swear. |
| 3 | 4 | Gertrude | O, speak to me no more;
These words, like dagge... |
| 4 | 7 | Laertes | Know you the hand? |
Compared to xsv flatten
¶
xsv flatten does do auto-wrapping of long entries, but doesn’t produce tableized output:
$ xsv flatten examples/hamlet.csv
act 1
scene 5
speaker Horatio
lines Propose the oath, my lord.
#
act 1
scene 5
speaker Hamlet
lines Never to speak of this that you have seen,
Swear by my sword.
#
act 1
scene 5
speaker Ghost
lines [Beneath] Swear.
#
act 3
scene 4
speaker Gertrude
lines O, speak to me no more;
These words, like daggers, enter in mine ears;
No more, sweet Hamlet!
#
act 4
scene 7
speaker Laertes
lines Know you the hand?
Compared to tabulate
¶
python-tabulate is a command-line tool for producing a variety of tabular outputs, including rst
, grid
, and html
formats. However, it does not handle multi-line fields well. Nor does it natively handle the CSV format, e.g. double-quoted values that contain commas, hence, the use of csvkit’s csvformat to change delimiters to \t
in the example below:
$ csvformat -T examples/hamlet.csv | tabulate -f grid -1 -s '\t'
+------------------------------------------------+---------+-----------+---------------------------------------------+
| act | scene | speaker | lines |
+================================================+=========+===========+=============================================+
| 1 | 5 | Horatio | Propose the oath, my lord. |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1 | 5 | Hamlet | "Never to speak of this that you have seen, |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| Swear by my sword." | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1 | 5 | Ghost | [Beneath] Swear. |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 3 | 4 | Gertrude | "O, speak to me no more; |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| These words, like daggers, enter in mine ears; | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| No more, sweet Hamlet!" | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 4 | 7 | Laertes | Know you the hand? |
+------------------------------------------------+---------+-----------+---------------------------------------------+
That said, if you like tabulate
’s table-formatting options, such as -f grid
, you can pipe csvflatten (and csvformat to convert to tab-delimiters) into tabulate
like so:
$ csvflatten --eor 'none' examples/hamlet.csv | csvformat -T \
tabulate -f grid -1 -s '\t'
+---------+------------------------------------------------+
| field | value |
+=========+================================================+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Horatio |
+---------+------------------------------------------------+
| lines | Propose the oath, my lord. |
+---------+------------------------------------------------+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Hamlet |
+---------+------------------------------------------------+
| lines | Never to speak of this that you have seen, |
+---------+------------------------------------------------+
| | Swear by my sword. |
+---------+------------------------------------------------+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Ghost |
+---------+------------------------------------------------+
| lines | [Beneath] Swear. |
+---------+------------------------------------------------+
| act | 3 |
+---------+------------------------------------------------+
| scene | 4 |
+---------+------------------------------------------------+
| speaker | Gertrude |
+---------+------------------------------------------------+
| lines | O, speak to me no more; |
+---------+------------------------------------------------+
| | These words, like daggers, enter in mine ears; |
+---------+------------------------------------------------+
| | No more, sweet Hamlet! |
+---------+------------------------------------------------+
| act | 4 |
+---------+------------------------------------------------+
| scene | 7 |
+---------+------------------------------------------------+
| speaker | Laertes |
+---------+------------------------------------------------+
| lines | Know you the hand? |
+---------+------------------------------------------------+
Reference: Options and usage¶
-P/--prettify¶
Print output in tabular format instead of CSV. Unless -L/--max-length
is explicitly specified, long values are split into multiple rows based on the current terminal width.
-L/--max-length [integer]¶
Specify a max character length for field values; values that exceed this length are split into multiple rows. This is useful for producing output easier to view in a spreadsheet:
$ csvflatten -L 5 examples/abc123.csv
field,value
code,alfa
blob,01234
,56789
~~~~~,
code,beta
blob,ABCDE
,FGHIJ
Or combining with the -P/--prettify
option:
$ csvflatten -P -L 5 examples/abc123.csv
| field | value |
| ----- | ----- |
| code | alfa |
| blob | 01234 |
| | 56789 |
| ~~~~~ | |
| code | beta |
| blob | ABCDE |
| | FGHIJ |
The default behavior is as follows:
- Without
--prettify
mode, this value is set to0
, i.e. no splitting of long values. - With
--prettify
mode, this value is automatically set to the width of the terminal. To disable this behavior, you can explicitly set--max-length 0
-B/--chunk-labels¶
When a value is chopped into chunks across multiple rows, by default, the field
(i.e. first column) is filled in for the value’s first row, then left blank for its subsequent rows:
| field | value |
+-------+---------+
| id | 001 |
| title | this is |
| | a story |
| | of love |
If the --chunk-labels
flag is set, each subsequent field
will be filled with an incremental label, e.g.:
| field | value |
+----------+---------+
| id | 001 |
| title | this is |
| title__1 | a story |
| title__2 | of love |
-E/--eor [END_OF_RECORD_MARKER (string)]¶
By default, each record is separated by having a string of tildes in field
, e.g.:
| field | value |
+-------+---------+
| id | 001 |
| title | this is |
| | a story |
| | of love |
| ~~~~~ | |
| id | 002 |
| title | Book 2 |
Set 'none'
to disable:
| field | value |
+-------+---------+
| id | 001 |
| title | this is |
| | a story |
| | of love |
| id | 002 |
| title | Book 2 |
Or to a value of your choosing:
$ csvflatten -E 'NEW-RECORD' data.csv
| field | value |
+------------+---------+
| id | 001 |
| title | this is |
| | a story |
| | of love |
| NEW-RECORD | |
| id | 002 |
| title | Book 2 |
-R/--rec-id¶
Include a _recid_
column that contains the 0-based index for the respective record:
| _recid_ | field | value |
+---------+-------+---------+
| 0 | id | 001 |
| 0 | title | this is |
| 0 | | a story |
| 0 | | of love |
| 1 | id | 002 |
| 1 | title | Book 2 |
Note that -rec-id
by default disables the end-of-record separator
Common scenarios and use cases¶
Making multiline tweets easier to read¶
about the data TK¶
TK Raw data comes from Twitter’s API and using twarc’s flattened CSV version:
$ twarc timeline POTUS --format csv
Reference file: tweets-newlines.csv
TK: image of data in excel
$ csvflatten -P examples/tweets-newlines.csv
First two tweets, in flattened format:
.. code-block:: text
field | value |---------------------------- | ------------------------------------------------------ |id | 1196869405629702148 |tweet_url | https://twitter.com/POTUS/status/1196869405629702148 |created_at | Tue Nov 19 19:14:37 +0000 2019 |parsed_created_at | 2019-11-19 19:14:37+00:00 |user_screen_name | POTUS |text | Since the day President @realDonaldTrump took office, || House Democrats have ignored working Americans to || focus on their only real agenda item: impeachment. || Meanwhile, our country created $11 trillion in worth. |tweet_type | retweet |coordinates | |hashtags | |media | |urls | |favorite_count | 40704 |in_reply_to_screen_name | |in_reply_to_status_id | |in_reply_to_user_id | |lang | en |place | |possibly_sensitive | |retweet_count | 12547 |retweet_or_quote_id | 1196864730226401280 |retweet_or_quote_screen_name | WhiteHouse |retweet_or_quote_user_id | 822215673812119553 |source | <a href=”https://www.sprinklr.com” || rel=”nofollow”>TheWhiteHouse</a> |user_id | 822215679726100480 |user_created_at | Thu Jan 19 22:54:28 +0000 2017 |user_default_profile_image | False |user_description | 45th President of the United States of America, || @realDonaldTrump. Tweets archived: |user_favourites_count | 104 |user_followers_count | 32630758 |user_friends_count | 39 |user_listed_count | 29517 |user_location | Washington, D.C. |user_name | President Trump |user_statuses_count | 10928 |user_time_zone | |user_urls | http://WhiteHouse.gov |user_verified | True |~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |id | 1240016437248438276 |tweet_url | https://twitter.com/POTUS/status/1240016437248438276 |created_at | Tue Mar 17 20:45:31 +0000 2020 |parsed_created_at | 2020-03-17 20:45:31+00:00 |user_screen_name | POTUS |text | It’s time for all Americans to do their part to stop || the transmission of the Coronavirus. We will come || out stronger than ever before! https://t.co/u0SHHpRJ0l |tweet_type | retweet |coordinates | |hashtags | |media | |urls | |favorite_count | 15363 |in_reply_to_screen_name | |in_reply_to_status_id | |in_reply_to_user_id | |lang | en |place | |possibly_sensitive | |retweet_count | 3779 |retweet_or_quote_id | 1240016322920091655 |retweet_or_quote_screen_name | WhiteHouse |retweet_or_quote_user_id | 822215673812119553 |source | <a href=”https://www.sprinklr.com” || rel=”nofollow”>TheWhiteHouse</a> |user_id | 822215679726100480 |user_created_at | Thu Jan 19 22:54:28 +0000 2017 |user_default_profile_image | False |user_description | 45th President of the United States of America, || @realDonaldTrump. Tweets archived: |user_favourites_count | 104 |user_followers_count | 32630764 |user_friends_count | 39 |user_listed_count | 29517 |user_location | Washington, D.C. |user_name | President Trump |user_statuses_count | 10928 |user_time_zone | |user_urls | http://WhiteHouse.gov |user_verified | True |