csvslice

csvslice is a command for selecting rows by 0-based index and/or inclusive ranges.

Given a data file, ids.csv:

$ csvslice -i 0,2-3 ids.csv

The output:

source
id,val
a,0
c,2
d,3

Options and flags

csvslice has 3 unique options, --indexes, --head, and --tail. Each of these specify a mode, and one and only one mode can be specified.

-i, --indexes <values>

<values> is a comma-delimited list of values representing individual indexes or ranges of indexes to be sliced and included from the 0-indexed dataset.

An index value can take these forms:

  • An individual row: -i 0 returns the very first row
  • A range of rows: -i 2-4 returns rows in the inclusive index range of 2 through 4
  • An open-ended range: -i 3- returns all rows starting from row index 3

Multiple interval values can be passed into -i/--indexes, e.g.

csvslice -i '0-5,12,15-18,42-' data.csv

--head <int>

Return the header and the first <int> rows; <int> must be greater than 0.

--tail <int>

Return the header and the last <int> rows; <int> must be greater than 0.

Usage overview and examples

These examples refer to data as found in ids.csv

Get the first n rows with --head

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolor

Get the last n rows with --tail

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Slicing individual rows with --index

You can specify rows to be sliced by 0-based index:

csvslice -i 1 ids.csv
id,val
1,b

You can also specify a series of individual indexes as a comma-delimited string:

csvslice -i 0,5 ids.csv
id,val
0,a
5,f

Slicing rows by an index range

Rows can be specified by using a range syntax: start-end

The range is inclusive:

$ csvslice -i 1-3 ids.csv
id,val
1,b
2,c
3,d

Omitting the right-side end value returns an open range of values:

$ csvslice -i 3- ids.csv
id,val
3,d
4,e
5,f

Like indexes, a series of ranges can be specified as a comma-delimited string:

$ csvslice -i 0-1,3- ids.csv
id,val
0,a
1,b
3,d
4,e
5,f

And you can combine ranges with individual indexes:

$ csvslice -i 0,2-3,5 ids.csv
id,val
0,a
2,c
3,d
5,f

Troubleshooting

Even though 3-1 and is technically a valid range, csvslice will throw an error if the end value is smaller than the start value:

$ csvslice -i 3-1 examples/ids.csv
InvalidRange: Invalid range specified: 3-1

For the most part, though, csvslice isn’t too whiny (in terms of warning you) about messy or otherwise nonsensical values for -i/--indexes.

No matter what order you specify the indexes and ranges, it will always return rows in sequential order:

$ csvslice -i 4,0,2 ids.csv
id,val
0,a
2,c
4,e
$ csvslice -i 4,0-2,3 ids.csv
id,val
0,a
1,b
2,c
3,d
4,e

If you pass in repeated indexes and/or overlapping ranges, csvslice will still only return the original, sequential data, i.e. it will not return duplicates of rows:

$ csvslice -i 3,1,3,1,1 ids.csv
id,val
1,b
3,d
$ csvslice -i 1,0-2,1-3 ids.csv
id,val
0,a
1,b
2,c
3,d

If you pass in references to non-existent row indexes, such as out-of-bounds numbers — those too are ignored:

csvslice -i 5,42 ids.csv
id,val
5,f

How csvslice compares to existing tools

Given ids.csv (TK: not sure I need to repeat this reference?)

head: Get the first n rows

The issue with head is that it only understands text and newline characters. Thus, specifying -n 3 returns 3 lines, one of which is the header line.

$ head -n 3 ids.csv
id,val
0,a
1,b

In contrast, csvslice has a notion of comma-delimited data, in which the first line is not data, but the header. Thus, an argument equivalent to -n 3 would return 4 lines in total: 1 header line and 3 data lines.

$ csvslice --head 3 ids.csv
id,val
0,a
1,b
2,c

tail: Get the last n rows

TK lorem with tail

$ tail -n 2 ids.csv
4,e
5,f

TK lorem with csvslice

$ csvslice --tail 2 ids.csv
id,val
4,e
5,f

xsv slice

csvslice is so much slower than xsv that the main reason to use csvslice for common functionality is that you just didn’t get around to installing xsv. lorem ipsum

TODO: section on benchmarks - make this its own include file

xsv slice a single record

$ xsv slice -i 2 ids.csv
id,val
2,c

lorem ipsum

$ csvslice -i 2 ids.csv
id,val
2,c

However, performance is much slower! Lorem ipsum TK.

xsv slice the first n records

$ xsv slice -l 3 ids.csv
0,a
1,b
2,c
$ csvslice ^-head 3 ids.csv

Note that xsv slice has no tail-like functionality, i.e. returning the last n rows.

xsv slice a range of records

$ xsv slice -s 1 -e 3 ids.csv
id,val
1,b
2,c

lorem csvslice uses inclusive ranges

$ csvslice -i 1-2 ids.csv
id,val
1,b
2,c

The agate library

TK Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut

The pandas library

TK Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Real-world use cases

Skipping the meta-header in Census data TK

About the data

You can view the data on the Census’s Data site

You can download a copy of the exported zip file: ACSDT1Y2019.B01003_2020-11-10T165412.zip

Given file acs-race.csv

TK use csvflatten to show file, or include screenshot of data in Excel

As you can see the first row of “data” is not really data, but another row of meta header, which we do not need.

TK: Using sed

Keep all rows but the second row, which is unneeded metadata TK

$ sed '2d' acs-pop.csv
"GEO_ID","NAME","B01003_001E","B01003_001M"
"0200000US1","Northeast Region","55982803","*****"
"0200000US2","Midwest Region","68329004","*****"
"0200000US3","South Region","125580448","*****"
"0200000US4","West Region","78347268","*****"

using csvsed

$ csvslice -i '1-' acs-pop.csv
GEO_ID,NAME,B01003_001E,B01003_001M
0200000US1,Northeast Region,55982803,*****
0200000US2,Midwest Region,68329004,*****
0200000US3,South Region,125580448,*****
0200000US4,West Region,78347268,*****