Essential Python Libraries for Data Manipulation

Image generated with Midjourney

As a data professional, it’s essential to understand how to process your data. In the modern era, it means using programming language to quickly manipulate our data set to achieve our expected results.

Python is the most popular programming language data professionals use, and many libraries are helpful for data manipulation. From a simple vector to parallelization, each use case has a library that could help.

So, what are these Python libraries that are essential for Data Manipulation? Let’s get into it.

1.NumPy

The first library we would discuss is NumPy. NumPy is an open-source library for scientific computing activity. It was developed in 2005 and has been used in many data science cases.

NumPy is a popular library, providing many valuable features in scientific computing activities such as array objects, vector operations, and mathematical functions. Also, many data science use cases rely on a complex table and matrices calculation, so NumPy allows users to simplify the calculation process.

Let’s try NumPy with Python. Many data science platforms, such as Anaconda, have Numpy installed by default. But you can always install them via Pip.

pip install numpy

After the installation, we would create a simple array and perform array operations.

import numpy as npa = np.array([1, 2, 3])b = np.array([4, 5, 6])c = a + bprint(c)

Output: [5 7 9]

We can also perform basic statistics calculations with NumPy.

data = np.array([1, 2, 3, 4, 5, 6, 7])mean = np.mean(data)median = np.median(data)std_dev = np.std(data)print(f"The data mean:{mean}, median:{median} and standard deviation: {std_dev}")

The data mean:4.0, median:4.0, and standard deviation: 2.0

It’s also possible to perform linear algebra operations such as matrix calculation.

2. Pandas

Pandas is the most popular data manipulation Python library for data professionals. I am sure that many of the data science learning classes would use Pandas as their basis for any subsequent process.

Pandas are famous because they have intuitive APIs yet are versatile, so many data manipulation problems can easily solved using the Pandas library. Pandas allows the user to perform data operations and analyze data from various input formats such as CSV, Excel, SQL databases, or JSON.

Pandas are built on top of NumPy, so NumPy object properties still apply to any Pandas object.

Let’s try on the library. Like NumPy, it’s usually available by default if you are using a Data Science platform such as Anaconda. However, you can follow the Pandas Installation guide if you are unsure.

You can try to initiate the dataset from the NumPy object and get a DataFrame object (Table-like) that shows the top five rows of data with the following code.

import numpy as npimport pandas as pdnp.random.seed(0)months = pd.date_range(start='2023-01-01', periods=12, freq='M')sales = np.random.randint(10000, 50000, size=12)transactions = np.random.randint(50, 200, size=12)data = {'Month': months,'Sales': sales,'Transactions': transactions}df = pd.DataFrame(data)df.head()

Essential Python Libraries for Data Manipulation - KDnuggets (2)

Then you can try several data manipulation activities, such as data selection.

df[df['Transactions'] <100]

It’s possible to do the Data calculation.

total_sales = df['Sales'].sum() average_transactions = df['Transactions'].mean()

3. Polars

Polars is a relatively new data manipulation Python library designed for the swift analysis of large datasets. Polars boast 30x performance gains compared to Pandas in several benchmark tests.

Polars is built on top of the Apache Arrow, so it’s efficient for memory management of the large dataset and allows for parallel processing. It also optimize their data manipulation performance using lazy execution that delays and computational until it’s necessary.

For the Polars installation, you can use the following code.

pip install polars

Like Pandas, you can initiate the Polars DataFrame with the following code.

import numpy as npimport polars as plnp.random.seed(0) employee_ids = np.arange(1, 101) ages = np.random.randint(20, 60, size=100) salaries = np.random.randint(30000, 100000, size=100) df = pl.DataFrame({ 'EmployeeID': employee_ids, 'Age': ages, 'Salary': salaries})df.head()

Essential Python Libraries for Data Manipulation - KDnuggets (3)

However, there are differences in how we use Polars to manipulate data. For example, here is how we select data with Polars.

df.filter(pl.col('Age') > 40)

The API is considerably more complex than Pandas, but it’s helpful if you require fast execution for large datasets. On the other hand, you would not get the benefit if the data size is small.

To know the details, you can refer to Josep Ferrer's article on how different Polars is are compared to Pandas.

4. Vaex

Vaex is similar to Polars as the library is developed specifically for considerable dataset data manipulation. However, there are differences in the way they process the dataset. For example, Vaex utilize memory-mapping techniques, while Polars focus on a multi-threaded approach.

Vaex is optimally suitable for datasets that are way bigger than what Polars intended to use. While Polars is also for extensive dataset manipulation processing, the library is ideally on datasets that still fit into memory size. At the same time, Vaex would be great to use on datasets that exceed the memory.

For the Vaex installation, it’s better to refer to their documentation, as it could break your system if it’s not done correctly.

5. CuPy

CuPy is an open-source library that enables GPU-accelerated computing in Python. It is CuPy that was designed for the NumPy and SciPy replacement if you need to run the calculation within NVIDIA CUDA or AMD ROCm platforms.

This makes CuPy great for applications that require intense numerical computation and need to use GPU acceleration. CuPy could utilize the parallel architecture of GPU and is beneficial for large-scale computations.

To install CuPy, refer to their GitHub repository, as many available versions might or might not suit the platforms you use. For example, below is for the CUDA platform.

pip install cupy-cuda11x

The APIs are similar to NumPy, so you can use CuPy instantly if you are already familiar with NumPy. For example, the code example for CuPy calculation is below.

import cupy as cpx = cp.arange(10)y = cp.array([2] * 10)z = x * yprint(cp.asnumpy(z))

CuPy is the end of an essential Python library if you are continuously working with high-scale computational data.

Conclusion

All the Python libraries we have explored are essential in certain use cases. NumPy and Pandas might be the basics, but libraries like Polars, Vaex, and CuPy would be beneficial in specific environments.

If you have any other library you deem essential, please share them in the comments!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Essential Python Libraries for Data Manipulation - KDnuggets (2024)