Suppose you have one CSV file with 2 fields: ID and email. You have another file with 2 fields: email and name. How can you produce a file with all three fields joined on email?
-t , : ',' is the field separator -k 2,2 : character sort on 2nd field -k 1,1 : character sort on 1st field -1 2 : file 1, 2nd field -2 1 : file 2, 1st field > : output to file
produces
email,ID,name email,ID,name ...
sorted by email alphabetically.
Note that if any email is missing from either file it will be omitted from the results."
"Perhaps it is overkill, but you could import into a database (e.g. OpenOffice Base) as two kinds of tables and define a report that is the desired output.
If the CSV import is a problem, then a spreadsheet program (e.g. OpenOffice Calc) can do the import. The result can then easily be transferred to the database."
It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.
For your case, you would just call:
crunch -in tableA.csv tableB.csv -out output.csv \ ""SELECT tableA.id, tableA.email, tableB.name FROM tableA LEFT JOIN tableB USING (email)""
The tool needs Java 8 or later.
Some of the advantages:
You really get CSV support, not just ""let's assume the data is correct"". You can join on multiple keys. Easier to use and understand than join-based solutions. You can combine more than 2 CSV files. You can join by SQL expressions - the values don't have to be the same.
Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it." "Try CSV Cruncher.
It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.
For your case, you would just call:
crunch -in tableA.csv tableB.csv -out output.csv \ ""SELECT tableA.id, tableA.email, tableB.name FROM tableA LEFT JOIN tableB USING (email)""
The tool needs Java 8 or later.
Some of the advantages:
You really get CSV support, not just ""let's assume the data is correct"". You can join on multiple keys. Easier to use and understand than join-based solutions. You can combine more than 2 CSV files. You can join by SQL expressions - the values don't have to be the same.
Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it."
"In Bash 5.0.3 with GNU Coreutils 8.30 and building off of hyperslug's answer:
If you have unsorted CSV files with duplicate lines and don't want to omit data due to a missing field in a line of either file1.csv or file2.csv, then you can do the following:
Sort file 1 by field 2 and sort file 2 by field 1:
-k 2,2 : character sort starting and stopping on 2nd field -k 1,1 : character sort starting and stopping on 1st field head -n1 : read first line tail -n+1: : read all but first line ( ) : subshell > : output to file
I had to do head and tail within the subshell ( ) in order to preserve the first header line of the CSV file when sorting by a given field.
-t , : ',' is the field separator -a1 : Do not omit lines from file 1 if no match in file 2 found -a2 : Do not omit lines from file 2 if no match in file 1 found. -1 2 : file 1, 2nd field -2 1 : file 2, 1st field -o auto : Auto format: includes extra commas indicating unmatched fields > : output to file
Here is an example file1.csv, file2.csv, and the resulting sort3.csv:
email,name tim@test.com,Timothy Brown bob@test.com,Robert Green raul@prueba.eu,Raul Vasquez carol@google.com,Carol Lindsey
sort3.csv:
email,ID,name bob@test.com,02,Robert Green carol@google.com,,Carol Lindsey frank@test.com,03, raul@prueba.eu,07,Raul Vasquez sally@test.com,05, tim@test.com,,Timothy Brown zulu@xi.com,11,
You can see Timothy Brown and Carol Lindsey lack IDs but are still included in the joined CSV file (with their names and emails in the correct fields)."