csvslice¶
csvslice is a command for selecting rows by 0-based index and/or inclusive ranges.
Given a data file, ids.csv:
$ csvslice -i 0,2-3 ids.csv
The output:
source
id,val
a,0
c,2
d,3
Table of contents
Options and flags¶
csvslice has 3 unique options, --indexes, --head, and --tail
. Each of these specify a mode, and one and only one mode can be specified.
-i, --indexes <values>¶
<values> is a comma-delimited list of values representing individual indexes or ranges of indexes to be sliced and included from the 0-indexed dataset.
An index value can take these forms:
- An individual row:
-i 0
returns the very first row - A range of rows:
-i 2-4
returns rows in the inclusive index range of 2 through 4 - An open-ended range:
-i 3-
returns all rows starting from row index 3
Multiple interval values can be passed into -i/--indexes
, e.g.
csvslice -i '0-5,12,15-18,42-' data.csv
--head <int>¶
Return the header and the first <int> rows; <int> must be greater than 0.
--tail <int>¶
Return the header and the last <int> rows; <int> must be greater than 0.
Usage overview and examples¶
These examples refer to data as found in ids.csv
Get the first n rows with --head¶
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolor
Get the last n rows with --tail¶
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Slicing individual rows with --index¶
You can specify rows to be sliced by 0-based index:
csvslice -i 1 ids.csv
id,val
1,b
You can also specify a series of individual indexes as a comma-delimited string:
csvslice -i 0,5 ids.csv
id,val
0,a
5,f
Slicing rows by an index range¶
Rows can be specified by using a range syntax: start-end
The range is inclusive:
$ csvslice -i 1-3 ids.csv
id,val
1,b
2,c
3,d
Omitting the right-side end value returns an open range of values:
$ csvslice -i 3- ids.csv
id,val
3,d
4,e
5,f
Like indexes, a series of ranges can be specified as a comma-delimited string:
$ csvslice -i 0-1,3- ids.csv
id,val
0,a
1,b
3,d
4,e
5,f
And you can combine ranges with individual indexes:
$ csvslice -i 0,2-3,5 ids.csv
id,val
0,a
2,c
3,d
5,f
Troubleshooting¶
Even though 3-1
and is technically a valid range, csvslice will throw an error if the end
value is smaller than the start
value:
$ csvslice -i 3-1 examples/ids.csv
InvalidRange: Invalid range specified: 3-1
For the most part, though, csvslice isn’t too whiny (in terms of warning you) about messy or otherwise nonsensical values for -i/--indexes
.
No matter what order you specify the indexes and ranges, it will always return rows in sequential order:
$ csvslice -i 4,0,2 ids.csv
id,val
0,a
2,c
4,e
$ csvslice -i 4,0-2,3 ids.csv
id,val
0,a
1,b
2,c
3,d
4,e
If you pass in repeated indexes and/or overlapping ranges, csvslice will still only return the original, sequential data, i.e. it will not return duplicates of rows:
$ csvslice -i 3,1,3,1,1 ids.csv
id,val
1,b
3,d
$ csvslice -i 1,0-2,1-3 ids.csv
id,val
0,a
1,b
2,c
3,d
If you pass in references to non-existent row indexes, such as out-of-bounds numbers — those too are ignored:
csvslice -i 5,42 ids.csv
id,val
5,f
How csvslice compares to existing tools¶
Given ids.csv
(TK: not sure I need to repeat this reference?)
head: Get the first n rows¶
The issue with head is that it only understands text and newline characters. Thus, specifying -n 3
returns 3 lines, one of which is the header line.
$ head -n 3 ids.csv
id,val
0,a
1,b
In contrast, csvslice has a notion of comma-delimited data, in which the first line is not data, but the header. Thus, an argument equivalent to -n 3
would return 4 lines in total: 1 header line and 3 data lines.
$ csvslice --head 3 ids.csv
id,val
0,a
1,b
2,c
tail: Get the last n rows¶
TK lorem with tail
$ tail -n 2 ids.csv
4,e
5,f
TK lorem with csvslice
$ csvslice --tail 2 ids.csv
id,val
4,e
5,f
csvformat: Skip the first n lines¶
$ TKTK ids.csv
xsv slice¶
csvslice is so much slower than xsv that the main reason to use csvslice for common functionality is that you just didn’t get around to installing xsv. lorem ipsum
TODO: section on benchmarks - make this its own include file
xsv slice a single record¶
$ xsv slice -i 2 ids.csv
id,val
2,c
lorem ipsum
$ csvslice -i 2 ids.csv
id,val
2,c
However, performance is much slower! Lorem ipsum TK.
xsv slice the first n records¶
$ xsv slice -l 3 ids.csv
0,a
1,b
2,c
$ csvslice ^-head 3 ids.csv
Note that xsv slice has no tail
-like functionality, i.e. returning the last n rows.
xsv slice a range of records¶
$ xsv slice -s 1 -e 3 ids.csv
id,val
1,b
2,c
lorem csvslice uses inclusive ranges
$ csvslice -i 1-2 ids.csv
id,val
1,b
2,c
The agate library¶
TK Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut
The pandas library¶
TK Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Real-world use cases¶
Skipping the meta-header in Census data TK¶
About the data¶
You can view the data on the Census’s Data site
You can download a copy of the exported zip file: ACSDT1Y2019.B01003_2020-11-10T165412.zip
Given file acs-race.csv
TK use csvflatten to show file, or include screenshot of data in Excel
As you can see the first row of “data” is not really data, but another row of meta header, which we do not need.
TK: Using sed
Keep all rows but the second row, which is unneeded metadata TK
$ sed '2d' acs-pop.csv
"GEO_ID","NAME","B01003_001E","B01003_001M"
"0200000US1","Northeast Region","55982803","*****"
"0200000US2","Midwest Region","68329004","*****"
"0200000US3","South Region","125580448","*****"
"0200000US4","West Region","78347268","*****"
using csvsed
$ csvslice -i '1-' acs-pop.csv
GEO_ID,NAME,B01003_001E,B01003_001M
0200000US1,Northeast Region,55982803,*****
0200000US2,Midwest Region,68329004,*****
0200000US3,South Region,125580448,*****
0200000US4,West Region,78347268,*****