Pandas read_csv full information – Machine Studying HD

Machine Learning

Pandas read_csv full information – Machine Studying HD

hhhhm

2023年12月8日

Pandas read_csv full information – Machine Studying HD

[ad_1]

Read_csv in pandas is a perform which gives us with capability learn, manipulate and write information to and from CSV( comma separated worth) recordsdata into jupyter pocket book, Ipython pocket book or python language usually. It’s fairly intuitive to make use of when it really works and when it fails, it may possibly fairly problematic. On this publish, you’ll study following :

Importing CSV recordsdata with the assistance of read_csv perform in pandas
What’s a CSV file?
Distinction between CSV and Textual content file
Distinction between CSV and XLS
Execs and Cons of CSV Information
Frequent errors whereas importing CSV file by means of pandas

Importing csv recordsdata in pandas

Studying a csv file in pandas is fairly straight ahead in pandas. You employ pandas read_csv perform and viola you’ll be able to learn the file. Right here is the code

#Importing the pandas package deal
import pandas as pd
# Studying the csv file utilizing pandas
wine = pd.read_csv('wine.csv')

If you’ll be able to learn your file, excellent news transfer on to EDA or evaluation in python or if you’re fortunate few who will get an error. You’re caught with the truth that file can’t be learn in python.

As that is frequent downside which happens many a instances and you’ll wish to remedy it. For fixing, we have to first perceive some fundamentals associated to comma separated values file after which apply these ideas to troubleshooting the issue.

What’s a CSV file?

CSV or also referred to as comma separated values is definitely a delimited textual content file which makes use of comma to separate values. In a CSV file, every line of file is a separate information document. It sometimes incorporates tabular information primarily numbers and textual content. Every line in csv file may have identical variety of fields.

Right here is an instance of CSV file in Microsoft excel and Notepad.

One purpose for recognition of csv recordsdata is ease to make use of as its simplicity and suppleness and second is the use to frequent software program the place CSV is used. For instance, it’s utilized in Microsoft Excel, Apple Numbers, Google Sheets, OpenOffice, LibreOffice and a lot of the textual content editors like Phrase Pad,notepad and so on.

What’s distinction between CSV file and tsv file?

Greatest distinction between CSV recordsdata and tsv recordsdata is the way in which information is saved. In csv recordsdata information is saved as comma separated values, whereas in tsv recordsdata it’s separated by tab (“t”).

Additionally learn: Find out how to iterate over rows in pandas dataframe?

What’s distinction between CSV file and textual content file?

Whereas each of the recordsdata are textual content recordsdata, main distinction is that textual content file denoted .txt extension are common kind identification, which simply signifies that it’s a human readable format. In case of CSV file, it’s a “comma separated worth” file.

What’s distinction between CSV file and xls file?

XLS extension recordsdata are binary spreadsheet recordsdata, which incorporates numbers,textual content for Microsoft Excel.

Benefits and Disadvantages of CSV file

Benefits	Disadvantages
Human readable and straightforward to edit	strikes principally primary information.
Easy to implement and parse	Advanced configurations can’t be imported and exported
processed in most of software	no distinction between textual content and numbers
simple schema	no customary solution to characterize binary information
compact and straightforward to generate	poor or no help of particular characters

Benefit and Disadvantages of CSV recordsdata

Frequent Errors whereas utilizing pandas read_csv

1) FileNotFoundError

This error pertains pandas not capable of finding the file in working listing the place your jupyter pocket book is positioned. For e.g. in home windows in case your file is on the market C:/paperwork/folder1/jupyternotebook.ipynb.

If the file isn’t out there on this folder which C:/paperwork/folder1, then in case you’ve got explicitly present the trail of the instance C:/file_name.csv

This idea is identical in Mac and Linux working system.

2) pandas.parser.CParserError: Error tokenizing information. C Error: Anticipated xx fields in line yy, noticed zz

There couple of issues you’ll be able to attempt to remedy downside

2.1) strive utilizing code under

df = pd.read_csv('datafile.csv',error_bad_lines = False)

Do word that unhealthy strains that are inflicting will skip, in case you this technique.

2.2) one other technique

df = pd.read_csv('datafile.csv',sep='delimiter',header=None)

2.3) one other technique

df = pd.read_csv('datafile.csv',skiprows = xx)

Once more on this, you rows might be skipped.

2.4) pandas can typically infer csv as tsv

information=pd.read_csv("File_path", sep='t')

2.5) one other technique

information = pd.read_csv(filename, delimiter=",", encoding='utf-8')

2.6) one other technique

df.read_csv('file.csv', encoding='utf8', engine="python")

It’s also possible to attempt to simply mix every part is one go and take a look at as effectively.

3) TypeError: information kind “datetime” not understood

This errors happen as csv recordsdata shouldn’t have datetime information kind because it infers information as solely strings, integers and floats. To unravel this, strive following code:

headers = ['col1', 'col2', 'col3', 'col4']
dtypes = {'col1': 'str', 'col2': 'str', 'col3': 'str', 'col4': 'float'}

parse_dates = ['col1', 'col2']

pd.read_csv(file, sep='t', header=None, names=headers, dtype=dtypes, parse_dates=parse_dates

Additionally learn: Find out how to discover distinctive worth in pandas?

4) Wish to learn a number of csv recordsdata in a single go

For few recordsdata, usually 10-15 recordsdata, this code ought to work

df = pd.concat(map(pd.read_csv, ['data/d1.csv', 'data/d2.csv','data/d3.csv']))

For extra recordsdata, do this code

from os import listdir

filepaths = [f for f in listdir("./data") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

5) MemoryError:

This error usually happens if you end up studying giant recordsdata as in comparison with CPU restrict in your pc of PC. There isn’t a clear solution to remedy, however you’ll be able to strive following:

5.1) Explicitly specify dtypes

import numpy as np
import pandas as pd

df_dtype = {
        "column_1": int,
        "column_2": str,
        "column_3": np.int16,
        "column_4": np.uint8,
        ...
        "column_n": np.float32
}

df = pd.read_csv('path/to/file', dtype=df_dtype)

Most important purpose for that is the very fact pandas is written in C programming language . Information construction worth in C are:

The utmost worth of UNSIGNED CHAR = 255                                    
The minimal worth of SHORT INT = -32768                                     
The utmost worth of SHORT INT = 32767                                      
The minimal worth of INT = -2147483648                                      
The utmost worth of INT = 2147483647                                       
The minimal worth of CHAR = -128                                            
The utmost worth of CHAR = 127                                             
The minimal worth of LONG = -9223372036854775808                            
The utmost worth of LONG = 9223372036854775807

5.2) Learn file by chunks

import pandas as pd

tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)
# offers TextFileReader, which is iterable with chunks of 1000 rows.

df = pd.concat(tp, ignore_index=True)  

# df is DataFrame. If errors, do `listing(tp)` as an alternative of `tp`

6) error: Integer column has NA values

This error typically happens when you’ve got null values in your csv file and pandas tries to pressure for float values and is unable to take action.

df = pd.read_csv('file.csv',dtype:{'id':'Int64'})

7) UnicodeDecodeError: ‘utf-8’ codec can’t decode byte xxxx in place yy: invalid begin byte

seek for varied encodings based mostly in your working system. Attempt to use this code with read_csv in pandas , encoding=’latin1′ or encoding = ‘iso-8859-1’ or encoding=’cp1252′

8) OSError: Initializing from file failed

This happens usually in case you shouldn’t have file learn permissions. It’s also possible to do this code

pd.read_csv('file1.csv', engine="python")

In Abstract, now we have mentioned pandas read_csv perform intimately, now we have additionally mentioned:

what’s a csv file and the way it’s structured
distinction between csv file, tsv file and textual content file
benefits and downsides of csv file
frequent errors associated to pandas read_csv perform

[ad_2]