Split a CSV in Windows Powershell.

Posted on Sun 14 August 2016 in 2016

This is truly boring way to kick-start a blog, but at least it sets a comfortably low bar for future posts.

For command-line manipulation of data, Unix really spoils us. Performing simple tasks like splitting a CSV file into several smaller ones is easy, thanks to thorough man pages and, most importantly, a large body of Stack Overflow questions that already cover nearly every use case.

So when I found myself on a Windows machine and having to split a several-thousand-line CSV file into several smaller files of just 680 lines each (for uploading to Google Keyword Planner), I was surprised by how difficult it was to find useful examples online.

First thing's first: don't use the Windows Command Prompt, but go straight to Windows Powershell. It's built in to every modern edition of Windows. To perform this simple task we need a basic loop and a couple of Cmdlets. Cmdlets are the Powershell equivalent of command-line programs on Unix. We can use and chain them together in a similar way, piping output from one to the next, and so on.

To split the file, we need to:

  1. Keep count. We'll need this as we'll want the output files be to numbered: output1.csv, output2.csv, and so on.
  2. Read x lines of the input file at a time. In this case, for Google Keyword Planner, x is 680 because that's the largest number of results it will return per submission. Don't believe them when they say it's 800.
  3. Write out the files.

To read the file, we need the Get-Content Cmdlet, and to write out the results we can use Out-File. The pipe character | behaves in the same way as on Unix, so we can read 680 lines of the input file at a time and pipe them through to Out-File. Set a counter variable first: we'll use i=0. The semicolon ends a statement, much like Bash:

$i=0;

Then we'll read the content, using the -ReadCount flag to pull in just 680 lines:

$i=0; Get-Content C:\Users\StephanS\Documents\example.csv -ReadCount 680

A nice touch in Powershell is that we don't have to actually bother with traditional loop syntax. The -ReadCount flag is smart enough to know that, if the output of Get-Content is piped somewhere, then -ReadCount lines should be sent through the pipeline each time. There's an assumption that you're going to be doing iterative things. So what we've created here is a pipeline, in which 680-line chunks of the file are sent...somewhere. Now we just need to work out what to do with them.

The function that consumes from this pipeline should write out the 680 lines to a new, numbered, file. And of course it should increment the value of i, so that the next 680-line chunk is written out to a new file. Here's how that looks in Powershell:

ForEach-Object {$i++; $_ | Out-File C:\Users\StephanS\Documents\exampleOutput_$i.csv}

Taking that apart: For Each Object we receive, increase the value of i by 1, then take the current object (referred to by the special variable $_, and pipe that through to the Out-File Cmdlet, which works in a pretty self-explanatory way. Of course, note that because we incremented i first, the output files will be numbered starting from the number 1. That's intentional here, but you could reverse the order of the statements in the code block above to start numbering from 0. Now we can put the whole thing together. Instead of ForEach-Object, I'll use the % sign for brevity (% in this context is just an alias for ForEach-Object).

$i=0; Get-Content C:\Users\StephanS\Documents\example.csv -ReadCount 680 | %{$i++; $_ | Out-File C:\Users\StephanS\Documents\exampleOutputName_$i.csv}

Easy! You can do more than this with Powershell, of course, so if your ambitions are loftier than splitting CSV files, check out the docs on Management Cmdlets. Add-Content is another useful one to have in the toolbox for annoying ad-hoc tasks.