Indexing and Sorting a dataframe using iloc and loc


There are multiple ways in pandas by which a dataframe can be indexed i.e, selecting particular set of rows and columns from a dataframe. For a detailed description over this topic, once can refer official pandas documentation - Indexing and Selecting Data

We’ll discuss the following -

Let’s begin with loading a sample dataset and required python packages.

import pandas as pd
import numpy as np
#Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header = None,
                   index_col = False,
                   names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
# visualising first five rows of sample dataset (Iris)
data.head()

The sample first five rows of data looks like (can be viewed using data.head())-

sepal_length sepal_width petal_length petal_width iris_class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
---

Integer based indexing using iloc

To select some fixed no. of column and a fixed no. of rows from this data, one way is to achieve it by using iloc operation. The first part of indexing will be for rows and another will be columns (indexes starting from 0 to total no. of rows/columns).
For example, first 10 rows for last three columns can be extracted by -
to pass a range : can be used while indexing [start:end], start being inclusive and end being exclusive

data.iloc[0:10,2:5]
petal_length petal_width iris_class
0 1.4 0.2 Iris-setosa
1 1.4 0.2 Iris-setosa
2 1.3 0.2 Iris-setosa
3 1.5 0.2 Iris-setosa
4 1.4 0.2 Iris-setosa
5 1.7 0.4 Iris-setosa
6 1.4 0.3 Iris-setosa
7 1.5 0.2 Iris-setosa
8 1.4 0.2 Iris-setosa
9 1.5 0.1 Iris-setosa
---

Similarly, specific rows and columns can be extracted using indexes of the corresponding elements.
The following command, will return 1st and 2nd row of 2nd and 4th column.

data.iloc[[0,1],[1,3]]
sepal_width petal_width
0 3.5 0.2
1 3.0 0.2
---

Another example, to extracting first 10 rows and all columns but first two.

data.iloc[:10,2:]
petal_length petal_width iris_class
0 1.4 0.2 Iris-setosa
1 1.4 0.2 Iris-setosa
2 1.3 0.2 Iris-setosa
3 1.5 0.2 Iris-setosa
4 1.4 0.2 Iris-setosa
5 1.7 0.4 Iris-setosa
6 1.4 0.3 Iris-setosa
7 1.5 0.2 Iris-setosa
8 1.4 0.2 Iris-setosa
9 1.5 0.1 Iris-setosa
---

Labels based indexing using loc

To index a dataframe based on column names, loc can be used.
For example, to get all the columns between petal_length till iris class and records from 2nd to 10th, can be extracted by using -

data.loc[2:10,'petal_length':'iris_class']
petal_length petal_width iris_class
2 1.3 0.2 Iris-setosa
3 1.5 0.2 Iris-setosa
4 1.4 0.2 Iris-setosa
5 1.7 0.4 Iris-setosa
6 1.4 0.3 Iris-setosa
7 1.5 0.2 Iris-setosa
8 1.4 0.2 Iris-setosa
9 1.5 0.1 Iris-setosa
10 1.5 0.2 Iris-setosa
---

Similarly, specific column names can be passed in a list for which we want to fetch the data.

data.loc[2:10,['petal_length','iris_class']]
petal_length iris_class
2 1.3 Iris-setosa
3 1.5 Iris-setosa
4 1.4 Iris-setosa
5 1.7 Iris-setosa
6 1.4 Iris-setosa
7 1.5 Iris-setosa
8 1.4 Iris-setosa
9 1.5 Iris-setosa
10 1.5 Iris-setosa
---

Setting values using loc

.loc can be used for setting values of particular records based on some predefined filter queries.

For example, there is a need to set petal_length of for first 11 records, except first two equals to 30 can be achieved by -
data.loc[2:10,'petal_length'] = 30

# Current state of those records -
data.loc[2:10,['petal_length']]
petal_length
2 1.3
3 1.5
4 1.4
5 1.7
6 1.4
7 1.5
8 1.4
9 1.5
10 1.5

data.loc[2:10,'petal_length'] = 30
data.loc[2:10]
sepal_length sepal_width petal_length petal_width iris_class
2 4.7 3.2 30.0 0.2 Iris-setosa
3 4.6 3.1 30.0 0.2 Iris-setosa
4 5.0 3.6 30.0 0.2 Iris-setosa
5 5.4 3.9 30.0 0.4 Iris-setosa
6 4.6 3.4 30.0 0.3 Iris-setosa
7 5.0 3.4 30.0 0.2 Iris-setosa
8 4.4 2.9 30.0 0.2 Iris-setosa
9 4.9 3.1 30.0 0.1 Iris-setosa
10 5.4 3.7 30.0 0.2 Iris-setosa
---

Another example of using .loc for setting values, make all records equal to null where ever sepal_width is greater than 3

# current state of data
data.head()
sepal_length sepal_width petal_length petal_width iris_class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 30.0 0.2 Iris-setosa
3 4.6 3.1 30.0 0.2 Iris-setosa
4 5.0 3.6 30.0 0.2 Iris-setosa

data.loc[data['sepal_width']>3] = np.nan
data.head()
sepal_length sepal_width petal_length petal_width iris_class
0 NaN NaN NaN NaN NaN
1 4.9 3.0 1.4 0.2 Iris-setosa
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
---

Sorting

Sorting a dataframe in python can be done in multiple ways. We’ll be looking two of those here -

  • sort_values
  • argsort

Let’s begin with reloading the dataset again and have a look at first six rows of sepal_width column.

#Fetching data from url as csv by mentioning values of various paramters
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                   header = None,
                   index_col = False,
                   names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class'])
sample_data = data.iloc[0:6,[1]]
sample_data
sepal_width
0 3.5
1 3.0
2 3.2
3 3.1
4 3.6
5 3.9
---

Using sort_values

Simples way to sort a dataframe can be done using sort_values function of pandas dataframe, which take the column name argument on which the sorting is to be done.

sample_data.sort_values(by='sepal_width')
sepal_width
1 3.0
3 3.1
2 3.2
0 3.5
4 3.6
5 3.9

Looking at the indexes of the sorted values, gives us the following results.

sample_data.sort_values(by='sepal_width').index.values

array([1, 3, 2, 0, 4, 5], dtype=int64)


Using argsort

Another way to sort a dataframe can be acheived by using argsort, which basically returns the list of indexes which will sort the values. That list of indexes can be passed to .iloc indexing and the output will return an sorted column.
To read more about argsort, please follow - https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

sample_data['sepal_width'].values.argsort(axis=0)

array([1, 3, 2, 0, 4, 5], dtype=int64)

sample_data.iloc[sample_data['sepal_width'].values.argsort(axis=0)]
sepal_width
1 3.0
3 3.1
2 3.2
0 3.5
4 3.6
5 3.9
---

Notebook Link - Indexing and Sorting a dataframe using iloc and loc{:target="_blank"}


Related Posts

Read data into pandas

December 23, 2018

Read More
Connect to azure storage (blob) using python

December 22, 2018

Read More
Connect to azure datalake store using python

December 20, 2018

Read More