Home » Questions » Computers [ Ask a new question ]

How to join two CSV files?

How to join two CSV files?

Suppose you have one CSV file with 2 fields: ID and email. You have another file with 2 fields: email and name. How can you produce a file with all three fields joined on email?

Asked by: Guest | Views: 426
Total answers/comments: 5
bert [Entry]

"Revision3:

You must sort both lists on email alphabetically, then join. Given that the email field the 2nd field of file1 and the 1st field of file2:

sort -t , -k 2,2 file1.csv > sort1.csv
sort -t , -k 1,1 file2.csv > sort2.csv
join -t , -1 2 -2 1 sort1.csv sort2.csv > sort3.csv

parameter meaning

-t , : ',' is the field separator
-k 2,2 : character sort on 2nd field
-k 1,1 : character sort on 1st field
-1 2 : file 1, 2nd field
-2 1 : file 2, 1st field
> : output to file

produces

email,ID,name
email,ID,name
...

sorted by email alphabetically.

Note that if any email is missing from either file it will be omitted from the results."
bert [Entry]

"Perhaps it is overkill, but you could import into a database
(e.g. OpenOffice Base) as two kinds of tables and define a
report that is the desired output.

If the CSV import is a problem, then a spreadsheet program
(e.g. OpenOffice Calc) can do the import. The result can
then easily be transferred to the database."
bert [Entry]

"Try CSV Cruncher.

It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.

For your case, you would just call:

crunch -in tableA.csv tableB.csv -out output.csv \
""SELECT tableA.id, tableA.email, tableB.name
FROM tableA LEFT JOIN tableB USING (email)""

The tool needs Java 8 or later.

Some of the advantages:

You really get CSV support, not just ""let's assume the data is correct"".
You can join on multiple keys.
Easier to use and understand than join-based solutions.
You can combine more than 2 CSV files.
You can join by SQL expressions - the values don't have to be the same.

Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it."
"Try CSV Cruncher.

It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.

For your case, you would just call:

crunch -in tableA.csv tableB.csv -out output.csv \
""SELECT tableA.id, tableA.email, tableB.name
FROM tableA LEFT JOIN tableB USING (email)""

The tool needs Java 8 or later.

Some of the advantages:

You really get CSV support, not just ""let's assume the data is correct"".
You can join on multiple keys.
Easier to use and understand than join-based solutions.
You can combine more than 2 CSV files.
You can join by SQL expressions - the values don't have to be the same.

Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it."
bert [Entry]

You could read the CSV file with a spreadsheet program like LibreOffice and use VLOOKUP() macro to search for the name in second file.
bert [Entry]

"In Bash 5.0.3 with GNU Coreutils 8.30 and building off of hyperslug's answer:

If you have unsorted CSV files with duplicate lines and don't want to omit data due to a missing field in a line of either file1.csv or file2.csv, then you can do the following:

Sort file 1 by field 2 and sort file 2 by field 1:

( head -n1 file1.csv && tail -n+2 file1.csv | sort -t, -k2,2 ) > sort1.csv
( head -n1 file2.csv && tail -n+2 file2.csv | sort -t, -k1,1 ) > sort2.csv

Expanding on hyperslug's parameters:

-k 2,2 : character sort starting and stopping on 2nd field
-k 1,1 : character sort starting and stopping on 1st field
head -n1 : read first line
tail -n+1: : read all but first line
( ) : subshell
> : output to file

I had to do head and tail within the subshell ( ) in order to preserve the first header line of the CSV file when sorting by a given field.

Then,

join -t , -a1 -a2 -1 2 -2 1 -o auto sort1.csv sort2.csv > sort3.csv

Expanding on hyperslug's parameters:

-t , : ',' is the field separator
-a1 : Do not omit lines from file 1 if no match in file 2 found
-a2 : Do not omit lines from file 2 if no match in file 1 found.
-1 2 : file 1, 2nd field
-2 1 : file 2, 1st field
-o auto : Auto format: includes extra commas indicating unmatched fields
> : output to file

Here is an example file1.csv, file2.csv, and the resulting sort3.csv:

file1.csv:

ID,email
02,bob@test.com
03,frank@test.com
05,sally@test.com
07,raul@prueba.eu
11,zulu@xi.com

file2.csv:

email,name
tim@test.com,Timothy Brown
bob@test.com,Robert Green
raul@prueba.eu,Raul Vasquez
carol@google.com,Carol Lindsey

sort3.csv:

email,ID,name
bob@test.com,02,Robert Green
carol@google.com,,Carol Lindsey
frank@test.com,03,
raul@prueba.eu,07,Raul Vasquez
sally@test.com,05,
tim@test.com,,Timothy Brown
zulu@xi.com,11,

You can see Timothy Brown and Carol Lindsey lack IDs but are still included in the joined CSV file (with their names and emails in the correct fields)."