If you’re preparing for a job as a data analyst or data engineer, Pandas and NumPy are two tools you must know really well. These Python libraries help you work with data efficiently, data cleaning, transforming also analyzing huge datasets.
Interviewers often ask practical questions that test how well you can handle real-world problems using these libraries.
In this guide, we have shared a wide range of Pandas and NumPy interview questions. These Q&As will help you get ready for your next technical interview with confidence.
Table of Contents
What is Pandas in Python?
Pandas is an open source Python library, it helps organize data in rows and columns, just like Excel but much more powerful. It can read files, sort information, find patterns in numbers, and create charts.
The jobs that use Pandas most include data scientists, data analysts, business analysts, research scientists, financial analysts, and marketing analysts.
What is NumPy?
NumPy helps computers work with lots of numbers at the same time, kind of like having a magic math box that can solve thousands of math problems instantly. Instead of doing math problems one by one, NumPy can handle huge lists of numbers all at once, making everything much faster.
The jobs that use NumPy most include data scientists, engineers, researchers, machine learning specialists, financial analysts, and scientists who work with lots of calculations.
Pandas Interview Questions and Answers
Que 1. How would you handle a very large CSV file (5GB+) using Pandas without running into memory issues?
Answer:
To manage large files, use the chunksize
parameter in read_csv()
to load the data in parts. Process each chunk iteratively to avoid loading the entire dataset into memory.
for chunk in pd.read_csv('large_file.csv', chunksize=100000):
# process each chunk
Also, consider using dtypes
to reduce memory usage and usecols
to load only required columns.
Que 2. Explain the difference between loc[], iloc[], and at[] in Pandas with examples.
Answer:
loc[]
is label-based indexing.iloc[]
is integer-position based indexing.at[]
is optimized for accessing a single scalar value via label (faster thanloc
).
df.loc[2, 'name'] # label-based
df.iloc[2, 1] # position-based
df.at[2, 'name'] # fast scalar access
Que 3. What are the performance considerations when applying functions to Pandas DataFrames?
Answer:
Avoid using apply()
with lambda for row-wise operations—it’s slow. Prefer vectorized operations or NumPy
functions for better performance. If needed, use cython
, numba
, or convert to DataFrame.to_numpy()
for faster computation.
Que 4. How can you merge two DataFrames with different keys, and ensure no row loss?
Answer:
Use merge()
with how='outer'
to retain all rows from both DataFrames, regardless of matching keys:
pd.merge(df1, df2, how='outer', left_on='id1', right_on='id2')
Que 5. Describe a scenario where groupby() can be misused or misunderstood.
Answer:
Misuse often happens when forgetting to reset index after groupby().agg()
, causing unexpected results in downstream merges or operations. Also, using groupby
on high-cardinality columns can degrade performance.
Que 6. How would you find and fill missing values in time series data while preserving trends?
Answer:
Use forward fill (ffill
) or backward fill (bfill
) with caution. Better approaches include interpolation:
df['value'].interpolate(method='time')
This keeps the temporal trend intact, especially useful in stock or IoT data.
Que 7. How do you handle multi-indexed DataFrames during analysis or export?
Answer:
Flatten the multi-index using:
df.reset_index()
Or combine the levels:
df.columns = ['_'.join(col).strip() for col in df.columns.values]
This ensures compatibility with Excel/CSV exports or further joins.
Que 8. Explain the difference between pivot() and pivot_table().
Answer:
pivot()
fails if there are duplicate entries for the index/column combo.pivot_table()
can aggregate those duplicates using functions likemean()
,sum()
, etc.
Use pivot_table()
for production-grade analytics where duplicate groups may exist.
Que 9. How would you optimize Pandas for processing high-frequency financial tick data?
Answer:
Optimization Technique |
---|
Use proper datetime64 indexing |
Store data in HDF5/Parquet for better I/O |
Use rolling window operations for speed |
Minimize type casting and column expansion |
Que 10. How do you compare two large DataFrames for differences row by row?
Answer:
Use:
df1.compare(df2)
Or boolean mask:
df1 != df2
For performance on large sets, align DataFrames and convert to NumPy arrays.
Que 11. How do you detect and remove duplicate rows based on specific columns?
Answer:
df.drop_duplicates(subset=['col1', 'col2'], keep='first')
Always inspect with df.duplicated()
before dropping to avoid data loss.
Que 12. How do you handle categorical data in Pandas for ML pipelines?
Answer:
Convert categories using:
astype('category')
for memory efficiency.pd.get_dummies()
orLabelEncoder
for ML-ready formats.
For large datasets, prefer category
type to reduce RAM usage significantly.
Que 13. Explain the role of eval() and query() in improving Pandas performance.
Answer:
Both use the numexpr
engine internally, allowing faster execution of expressions compared to native Python loops.
df.eval('new_col = col1 + col2')
df.query('col1 > 10 and col2 < 50')
Que 14. How do you extract rows with top 3 highest values for each group?
Answer:
df.groupby('group').apply(lambda x: x.nlargest(3, 'value'))
Or use transform()
to rank and filter accordingly.
Que 15. How do you deal with timezone-aware datetime columns?
Answer:
Ensure consistent timezone using tz_convert()
or tz_localize()
:
df['timestamp'] = df['timestamp'].dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
This avoids errors in comparisons, joins, or plotting.
Que 16. How do you create a new column in a Pandas DataFrame based on existing columns?
Answer:
Create a new column by assigning values derived from existing columns using arithmetic operations, functions, or apply()
. For efficiency, prefer vectorized operations over apply()
.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'] + df['B'] # Vectorized addition
# Or using apply
df['D'] = df.apply(lambda row: row['A'] * row['B'], axis=1)
Que 17. What is the purpose of the merge() function in Pandas?
Answer:
The merge()
function combines two DataFrames based on common columns or indices, supporting join types like inner, outer, left, or right. Specify keys with on
, left_on
, or right_on
.
df1 = pd.DataFrame({'key': ['A', 'B'], 'value1': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'value2': [3, 4]})
result = pd.merge(df1, df2, on='key', how='inner') # Merges on 'key'
Que 18. How do you filter rows in a Pandas DataFrame based on multiple conditions?
Answer:
Use boolean indexing with &
(and), |
(or), and parentheses to combine conditions. Apply the filter to the DataFrame to select matching rows.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
filtered = df[(df['A'] > 1) & (df['B'] < 6)] # Rows where A > 1 and B < 6
Que 19. What is the value_counts() method in Pandas, and how is it used?
Answer:value_counts()
counts unique values in a Series, returning a Series with counts sorted in descending order. It’s useful for analyzing frequency distributions.
df = pd.DataFrame({'A': ['x', 'y', 'x', 'z']})
counts = df['A'].value_counts() # Returns: x: 2, y: 1, z: 1
Que 20. How do you sort a Pandas DataFrame by one or more columns?
Answer:
Use sort_values()
to sort by specified columns, with ascending=True
or False
. Multiple columns can be sorted hierarchically by passing a list.
df = pd.DataFrame({'A': [3, 1, 2], 'B': [6, 4, 5]})
sorted_df = df.sort_values(by=['A', 'B'], ascending=[True, False])
Que 21. How do you handle duplicate columns in a Pandas DataFrame?
Answer:
Identify duplicate columns with df.columns.duplicated()
. Drop duplicates using df.loc[:, ~df.columns.duplicated()]
or rename them to avoid conflicts during analysis.
df = pd.DataFrame([[1, 2, 3]], columns=['A', 'A', 'B'])
df = df.loc[:, ~df.columns.duplicated()] # Keeps unique columns
Que 22. What is the describe() method in Pandas, and what does it provide?
Answer:describe()
generates descriptive statistics for numeric columns, including count, mean, std, min, max, and quartiles. Use include='all'
for non-numeric columns.
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']})
stats = df.describe() # Statistics for numeric column A
Que 23. How do you concatenate multiple DataFrames in Pandas?
Answer:
Use pd.concat()
to combine DataFrames vertically (axis=0
) or horizontally (axis=1
). Specify ignore_index=True
for vertical concatenation to reset indices.
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
result = pd.concat([df1, df2], axis=0, ignore_index=True)
Que 24. How do you rename columns in a Pandas DataFrame?
Answer:
Use rename()
with a dictionary mapping old to new column names, or assign a new list to df.columns
. Use inplace=True
to modify the DataFrame directly.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.rename(columns={'A': 'X', 'B': 'Y'}, inplace=True)
Que 25. How do you convert a Pandas DataFrame to a CSV file?
Answer:
Use to_csv()
to export a DataFrame to a CSV file, specifying the file path and optional parameters like index=False
to exclude the index.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_csv('output.csv', index=False)
Que 26. What is the shift() method in Pandas, and how is it used?
Answer:shift()
moves data in a Series or DataFrame by a specified number of periods, useful for time series analysis (e.g., calculating differences). Positive values shift down, negative shift up.
df = pd.DataFrame({'A': [1, 2, 3]})
df['A_shifted'] = df['A'].shift(1) # Shifts A down by 1
Que 27. How do you apply a custom function to a Pandas DataFrame group?
Answer:
Use groupby()
followed by apply()
or agg()
to apply a custom function to each group. For complex operations, apply()
is more flexible but slower than vectorized methods.
df = pd.DataFrame({'A': ['x', 'x', 'y'], 'B': [1, 2, 3]})
result = df.groupby('A')['B'].apply(lambda x: x.sum())
Que 28. How do you handle multi-level indexing in a Pandas DataFrame?
Answer:
Create multi-level indices with set_index()
or pivot_table()
. Access data using loc
with tuples or xs()
for cross-sections. Reset indices with reset_index()
for flattening.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': ['x', 'y']})
df.set_index(['C', 'A'], inplace=True)
value = df.loc[('x', 1), 'B'] # Access specific value
Que 29. How do you perform a left join in Pandas?
Answer:
Use merge()
with how='left'
to keep all rows from the left DataFrame, matching rows from the right. Non-matching rows from the right are filled with NaN.
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'data': [3, 4]})
result = pd.merge(df1, df2, on='key', how='left')
Que 30. How do you drop rows or columns in a Pandas DataFrame?
Answer:
Use drop()
with axis=0
for rows or axis=1
for columns, specifying labels or indices. Use inplace=True
to modify the DataFrame directly.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.drop('B', axis=1, inplace=True) # Drops column B
df.drop(0, axis=0, inplace=True) # Drops row at index 0
Que 31. How do you handle string operations in a Pandas DataFrame?
Answer:
Use the .str
accessor on a Series to perform string operations like lower()
, replace()
, or contains()
. These are vectorized for efficiency.
df = pd.DataFrame({'A': ['Foo', 'Bar']})
df['A_lower'] = df['A'].str.lower() # Converts to lowercase
Que 32. What is the resample() method in Pandas, and how is it used?
Answer:resample()
aggregates time series data over specified intervals (e.g., daily, monthly). Set a datetime index, then apply aggregations like mean()
or sum()
.
df = pd.DataFrame({'value': [1, 2, 3]}, index=pd.date_range('2023-01-01', periods=3, freq='H'))
resampled = df.resample('D').sum() # Aggregates hourly to daily
Que 33. How do you convert a Pandas Series to a NumPy array?
Answer:
Use the .to_numpy()
method or .values
(deprecated) to convert a Series to a NumPy array, useful for numerical computations or compatibility with other libraries.
s = pd.Series([1, 2, 3])
array = s.to_numpy() # Returns NumPy array [1, 2, 3]
Que 34. How do you handle large datasets in Pandas with memory constraints?
Answer:
Use chunksize
in read_csv()
to process data in chunks, or use dask.dataframe
for out-of-memory computation. Select only needed columns, use appropriate dtypes (e.g., float32
), and leverage to_parquet()
for efficient storage.
for chunk in pd.read_csv('large.csv', chunksize=10000):
process(chunk)
Que 35. How do you create a pivot table in Pandas?
Answer:
Use pivot_table()
to summarize data, specifying index
, columns
, values
, and aggregation functions (e.g., mean, sum). Handle duplicates with aggfunc
.
df = pd.DataFrame({'A': ['x', 'x', 'y'], 'B': ['p', 'q', 'p'], 'C': [1, 2, 3]})
pivot = df.pivot_table(values='C', index='A', columns='B', aggfunc='sum')
Que 36. How do you calculate cumulative sums in a Pandas DataFrame?
Answer:
Use cumsum()
on a Series or DataFrame to compute cumulative sums along an axis, useful for running totals in time series or financial data.
df = pd.DataFrame({'A': [1, 2, 3]})
df['cumsum'] = df['A'].cumsum() # Returns [1, 3, 6]
Que 37. How do you apply conditional formatting to a Pandas DataFrame for visualization?
Answer:
Use style.apply()
or style.applymap()
to apply formatting (e.g., background colors) based on conditions. Export to HTML or Excel for visualization.
df = pd.DataFrame({'A': [1, 2, 3]})
styled = df.style.apply(lambda x: ['background: yellow' if v > 2 else '' for v in x])
Que 38. How do you handle missing values in a Pandas DataFrame?
Answer:
Identify missing values with isna()
or isnull()
. Fill with fillna()
(e.g., mean, forward-fill) or drop with dropna()
based on analysis needs.
df = pd.DataFrame({'A': [1, None, 3]})
df['A'] = df['A'].fillna(df['A'].mean()) # Fills with mean
Que 39. How do you split a DataFrame into multiple smaller DataFrames?
Answer:
Use np.array_split()
or slice based on indices. For grouped splits, use groupby()
and iterate over groups to create separate DataFrames.
import numpy as np
df = pd.DataFrame({'A': range(10)})
split_dfs = np.array_split(df, 2) # Splits into two DataFrames
Que 40. How do you perform a cross-tabulation in Pandas?
Answer:
Use pd.crosstab()
to compute a frequency table of two or more factors, useful for analyzing categorical data relationships.
df = pd.DataFrame({'A': ['x', 'x', 'y'], 'B': ['p', 'q', 'p']})
cross_tab = pd.crosstab(df['A'], df['B'])

Also Check: Python Interview Questions and Answers
NumPy Interview Questions and Answers
Que 41. How does NumPy handle memory efficiency compared to Python lists?
Answer:
NumPy uses contiguous memory blocks and a fixed data type, which allows it to be more memory-efficient and faster than Python lists that store pointers to objects.
Que 42. What are broadcasting rules in NumPy?
Answer:
Broadcasting allows NumPy to perform arithmetic operations on arrays of different shapes. It follows a set of rules to stretch the smaller array across the larger one without copying data.
Que 43. How would you avoid unnecessary memory duplication while slicing arrays?
Answer:
By default, slicing in NumPy returns a view, not a copy. To avoid duplication, continue using views unless a copy is explicitly required using .copy()
.
Que 44. What is the difference between np.array_equal() and np.allclose()?
Answer:np.array_equal()
checks for exact equality, including shape and values, while np.allclose()
checks if values are approximately equal within a given tolerance, useful for floating-point comparisons.
Que 45. How do you improve the performance of mathematical operations on large NumPy arrays?
Answer:
Use vectorized operations, avoid explicit loops, leverage broadcasting, and use in-place operations where possible to reduce memory overhead and improve speed.
Que 46. Explain the use of np.where() in NumPy.
Answer:np.where()
is used for conditional filtering and element selection. It returns indices or elements based on a condition, useful in constructing conditional arrays.
Que 47. How would you flatten a multi-dimensional array, and what are the trade-offs?
Answer:
Use ravel()
(returns a view if possible) or flatten()
(always returns a copy). Choose ravel()
when performance matters and copying isn’t needed.
Que 48. What’s the use of np.meshgrid() in numerical computations?
Answer:np.meshgrid()
creates coordinate matrices from coordinate vectors. It’s essential in evaluating functions over a grid in 2D/3D plotting and numerical simulations.
Que 49. Describe the impact of dtype in NumPy operations.
Answer:dtype
defines the type of data (int, float, etc.) and its precision. Choosing an appropriate dtype
improves memory usage and computation speed.
Que 50. How does NumPy handle missing or NaN values?
Answer:
NumPy supports NaN via np.nan
, but operations involving NaNs generally return NaN. Use functions like np.isnan()
, np.nanmean()
, or mask invalid entries for handling.
Que 51. What are structured arrays in NumPy?
Answer:
Structured arrays allow different data types in one array (like a table). They’re useful for datasets that contain mixed-type records similar to a database table.
Que 52. Explain how you would optimize matrix operations in NumPy.
Answer:
Use @
or np.dot()
for matrix multiplication, prefer in-place operations, use BLAS/LAPACK-accelerated libraries, and avoid reshaping arrays unnecessarily.
Que 53. How does NumPy support linear algebra operations?
Answer:
NumPy provides np.linalg
module with functions like inv()
, eig()
, svd()
, and solve()
for matrix inversion, eigenvalues, singular value decomposition, and solving linear systems.
Que 54. What is memory layout in NumPy (C vs Fortran order), and why does it matter?
Answer:
C-order is row-major, Fortran-order is column-major. It impacts performance in matrix operations. Choose the right layout to optimize cache usage and computation speed.
Que 55. How would you handle large datasets that don’t fit in memory using NumPy?
Answer:
Use memory-mapped files (np.memmap
) to read large binary data in chunks, enabling efficient processing without loading everything into RAM.
Que 56. How do you create a NumPy array with a specific data type?
Answer:
Use the dtype
parameter when creating a NumPy array with functions like np.array()
, np.zeros()
, or np.ones()
. Specify types like int32
, float64
, or bool
.
import numpy as np
arr = np.array([1, 2, 3], dtype=np.float64) # Creates array with float64 type
Que 57. What is the purpose of np.zeros_like() and np.ones_like() in NumPy?
Answer:np.zeros_like()
and np.ones_like()
create arrays of zeros or ones with the same shape and data type as an input array, useful for initializing compatible arrays.
arr = np.array([[1, 2], [3, 4]])
zeros = np.zeros_like(arr) # Array of zeros with same shape and dtype
Que 58. How do you perform element-wise comparison in NumPy?
Answer:
Use comparison operators (e.g., >
, ==
) to perform element-wise comparisons, returning a boolean array. Combine with np.where()
for conditional operations.
arr = np.array([1, 2, 3])
result = arr > 2 # Returns [False, False, True]
Que 59. What is the np.concatenate() function, and how is it used?
Answer:np.concatenate()
joins multiple arrays along a specified axis. It requires arrays to have compatible shapes except along the concatenation axis.
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6]])
result = np.concatenate((arr1, arr2), axis=0) # Vertical concatenation
Que 60. How do you compute the dot product of two arrays in NumPy?
Answer:
Use np.dot()
or the @
operator for matrix multiplication or dot product of vectors. For 2D arrays, it performs matrix multiplication; for 1D, it computes the dot product.
a = np.array([1, 2])
b = np.array([3, 4])
dot_product = np.dot(a, b) # Returns 11 (1*3 + 2*4)
Que 61. What is the np.vstack() and np.hstack() functions in NumPy?
Answer:np.vstack()
stacks arrays vertically (row-wise), and np.hstack()
stacks them horizontally (column-wise). Both require compatible shapes along non-stacking axes.
a = np.array([1, 2])
b = np.array([3, 4])
vstacked = np.vstack((a, b)) # Stacks as [[1, 2], [3, 4]]
hstacked = np.hstack((a, b)) # Stacks as [1, 2, 3, 4]
Que 62. How do you generate random numbers in NumPy?
Answer:
Use np.random
module functions like np.random.rand()
for uniform distribution or np.random.randn()
for standard normal distribution. Set a seed with np.random.seed()
for reproducibility.
np.random.seed(42)
rand_nums = np.random.rand(3) # Random numbers between 0 and 1
Que 63. What is the purpose of np.linalg.inv() in NumPy?
Answer:np.linalg.inv()
computes the inverse of a square matrix, useful for solving linear systems or matrix operations. The input matrix must be non-singular.
matrix = np.array([[1, 2], [3, 4]])
inverse = np.linalg.inv(matrix) # Computes inverse matrix
Que 64. How do you reshape a NumPy array without changing its data?
Answer:
Use reshape()
to change the array’s shape while preserving data, ensuring the total number of elements remains the same. Use -1
to infer one dimension.
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped = arr.reshape(2, 3) # Reshapes to 2x3 array
Que 65. What is np.transpose() and how is it used?
Answer:np.transpose()
swaps axes of an array, effectively transposing it. For 2D arrays, it switches rows and columns. Specify axes for higher-dimensional arrays.
arr = np.array([[1, 2], [3, 4]])
transposed = np.transpose(arr) # Returns [[1, 3], [2, 4]]
Que 66. How do you compute the cumulative product of a NumPy array?
Answer:
Use np.cumprod()
to calculate the cumulative product along a specified axis, useful for sequential multiplication in data analysis.
arr = np.array([1, 2, 3, 4])
cumprod = np.cumprod(arr) # Returns [1, 2, 6, 24]
Que 67. What is the purpose of np.unique() in NumPy?
Answer:np.unique()
returns unique elements of an array, optionally sorted. Use return_counts=True
to get the frequency of each unique value.
arr = np.array([1, 2, 2, 3])
unique = np.unique(arr) # Returns [1, 2, 3]
Que 68. How do you perform matrix multiplication in NumPy?
Answer:
Use np.matmul()
or the @
operator for matrix multiplication. For 2D arrays, it follows standard matrix rules; for higher dimensions, it operates on the last two axes.
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
result = np.matmul(a, b) # Matrix multiplication
Que 69. How do you handle broadcasting errors in NumPy?
Answer:
Broadcasting errors occur when array shapes are incompatible. Reshape arrays with np.reshape()
or add dimensions with np.expand_dims()
to align shapes, ensuring compatibility for element-wise operations.
a = np.array([1, 2, 3])
b = np.array([[4], [5], [6]])
result = a + b # Broadcasting aligns shapes
Que 70. What is the np.argmax() and np.argmin() functions in NumPy?
Answer:np.argmax()
returns the indices of maximum values, and np.argmin()
returns indices of minimum values along an axis or flattened array.
arr = np.array([3, 1, 4, 2])
max_idx = np.argmax(arr) # Returns 2 (index of 4)
Que 71. How do you save and load NumPy arrays to/from disk?
Answer:
Use np.save()
to save arrays in .npy
format and np.load()
to load them. For multiple arrays, use np.savez()
to create a .npz
archive.
arr = np.array([1, 2, 3])
np.save('array.npy', arr)
loaded = np.load('array.npy')
Que 72. What is the purpose of np.clip() in NumPy?
Answer:np.clip()
limits array values to a specified range, setting values below or above bounds to the minimum or maximum. Useful for data normalization.
arr = np.array([1, 5, 10])
clipped = np.clip(arr, 2, 8) # Returns [2, 5, 8]
Que 73. How do you compute the standard deviation of a NumPy array?
Answer:
Use np.std()
to calculate the standard deviation along a specified axis or the entire array, with optional degrees of freedom (ddof
).
arr = np.array([1, 2, 3, 4])
std_dev = np.std(arr) # Returns standard deviation
Que 74. What is np.linspace() and how does it differ from np.arange()?
Answer:np.linspace()
creates an array with evenly spaced numbers over a specified interval, defining the number of points. np.arange()
uses a step size. linspace
is better for fixed-length sequences, arange
for step-based.
linspace = np.linspace(0, 10, 5) # Returns [0, 2.5, 5, 7.5, 10]
arange = np.arange(0, 10, 2) # Returns [0, 2, 4, 6, 8]
Que 75. How do you perform element-wise exponentiation in NumPy?
Answer:
Use np.power()
or the **
operator for element-wise exponentiation, raising each element to a specified power.
arr = np.array([1, 2, 3])
powered = np.power(arr, 2) # Returns [1, 4, 9]
Que 76. What is the purpose of np.any() and np.all() in NumPy?
Answer:np.any()
checks if any element in an array evaluates to True
, while np.all()
checks if all elements are True
. Useful for boolean operations along an axis.
arr = np.array([True, False, True])
any_true = np.any(arr) # Returns True
all_true = np.all(arr) # Returns False
Que 77. How do you split a NumPy array into multiple sub-arrays?
Answer:
Use np.split()
, np.vsplit()
, or np.hsplit()
to split arrays along an axis into equal or specified-sized sub-arrays.
arr = np.array([1, 2, 3, 4])
sub_arrays = np.split(arr, 2) # Splits into [[1, 2], [3, 4]]
Que 78. What is np.diagonal() used for in NumPy?
Answer:np.diagonal()
extracts the diagonal elements of a 2D array or higher, with an optional offset to select off-diagonal elements.
arr = np.array([[1, 2], [3, 4]])
diagonal = np.diagonal(arr) # Returns [1, 4]
Que 79. How do you compute the cross product of two vectors in NumPy?
Answer:
Use np.cross()
to compute the cross product of two 1D or 2D arrays representing vectors, returning a vector perpendicular to both inputs.
a = np.array([1, 0, 0])
b = np.array([0, 1, 0])
cross = np.cross(a, b) # Returns [0, 0, 1]
Que 80. How do you handle array broadcasting with different shapes in NumPy?
Answer:
Broadcasting aligns arrays by stretching dimensions of size 1 to match larger shapes. Ensure compatibility by reshaping or adding dimensions with np.expand_dims()
to avoid shape mismatch errors.
a = np.array([1, 2, 3])
b = np.array([[4], [5]])
result = a + b # Broadcasts to [[5, 6, 7], [6, 7, 8]]

Pandas and NumPy Interview Questions and Answers for Data Analyst
Que 81. How would you handle missing data in a Pandas DataFrame?
Answer:
You can handle missing data using dropna()
to remove rows/columns with null values, fillna()
to replace them with a value like mean or median, or interpolate values using interpolate()
for continuous data.
Que 82. What is the difference between .loc[], .iloc[], and .at[] in Pandas?
Answer:.loc[]
is label-based indexing, .iloc[]
is integer-based indexing, and .at[]
is optimized for accessing a single scalar value using label-based indexing, offering better performance than .loc[]
.
Que 83. How would you join two large DataFrames in Pandas efficiently?
Answer:
Use merge()
with appropriate on
and how
parameters, and set indexes or sort keys beforehand. For very large datasets, consider chunking or using categorical
data types to optimize memory.
Que 84. How do you identify and remove duplicate rows in a DataFrame?
Answer:
Use df.duplicated()
to identify duplicates and df.drop_duplicates()
to remove them. You can also specify columns with the subset
parameter and control which duplicate to keep using keep
.
Que 85. What are the key differences between NumPy arrays and Pandas Series?
Answer:
NumPy arrays are homogeneous and indexed by position, while Pandas Series are one-dimensional labeled arrays, allowing mixed data types and better alignment with real-world datasets.
Que 86. How do you calculate rolling statistics in Pandas?
Answer:
Use the rolling()
method followed by an aggregation function. For example, df['sales'].rolling(window=7).mean()
computes a 7-day moving average, useful for trend analysis.
Que 87. When would you use groupby() in Pandas?
Answer:
Use groupby()
to split data into groups, apply aggregation functions (like sum()
, mean()
), and combine results. It’s powerful for summarizing and transforming data by category.
Que 88. How can you reshape data using NumPy and Pandas?
Answer:
Use NumPy’s reshape()
and ravel()
for multi-dimensional arrays. In Pandas, use pivot()
, melt()
, and stack()
/unstack()
to reshape data between wide and long formats.
Que 89. What’s the difference between np.mean() and np.average()?
Answer:np.mean()
computes the arithmetic mean, while np.average()
allows weighting the values via a weights
parameter. Use np.average()
when values have different importances.
Que 90. How would you detect and handle outliers using Pandas and NumPy?
Answer:
You can use statistical methods like IQR (quantile()
), Z-score (scipy.stats.zscore()
or manual formula), or visual methods like boxplots to detect outliers. Handling involves filtering, capping, or transforming them based on context.
Also Check: Data Analyst Interview Questions and Answers
Pandas and NumPy Interview Questions and Answers for Data engineer
Que 91. How do you handle very large datasets in Pandas without running out of memory?
Answer:
Use chunksize
while reading data (e.g., pd.read_csv()
), downcast numeric data types to save memory, drop unused columns early, use category
for repeated strings, and consider Dask or Vaex for distributed computation.
Que 92. How would you perform an efficient join of two large datasets using Pandas?
Answer:
Set appropriate indexes or keys before joining, reduce memory footprint by selecting only needed columns, and use merge()
with sorted keys or hashed values. Also, consider merging in chunks if datasets are huge.
Que 93. Explain the difference between shallow copy and deep copy in NumPy and Pandas.
Answer:
A shallow copy creates a new object but refers to the same data (view
in NumPy), while a deep copy creates a new object with a completely new copy of the data (copy()
method in both libraries).
Que 94. How do you apply transformations efficiently to entire columns in Pandas?
Answer:
Use vectorized operations like df['col'] = df['col'] * 2
instead of row-wise loops. You can also use apply()
, map()
, or transform()
when working with custom logic, but prefer native vectorized methods for speed.
Que 95. How would you optimize a DataFrame with 100+ million rows?
Answer:
Use dtype
optimization (e.g., convert float64
to float32
), use category
types for repeated strings, drop null-heavy columns, use filters before joins, and avoid loading all data at once—use generators or Dask.
Que 96. How can you write and read compressed DataFrames in Pandas?
Answer:
Use compression
parameter in read_csv()
and to_csv()
. For example:df.to_csv('data.csv.gz', compression='gzip')
andpd.read_csv('data.csv.gz', compression='gzip')
.
Que 97. How do you use NumPy broadcasting in matrix computations?
Answer:
Broadcasting allows arithmetic operations between arrays of different shapes. For example:
a = np.array([1, 2, 3])
b = np.array([[1], [2], [3]])
result = a + b # shape becomes (3, 3)
It’s memory efficient and avoids explicit looping.
Que 98. How can you handle datetime manipulation in large Pandas datasets?
Answer:
Use pd.to_datetime()
to parse dates, then access components like .dt.year
, .dt.month
, etc. Set parse_dates=True
when reading CSVs. Use .resample()
for time series aggregation and downsample with asfreq()
.
Que 99. How do you detect data type mismatches in a DataFrame?
Answer:
Use df.dtypes
to inspect column types. Combine with applymap()
or pd.to_numeric(errors='coerce')
to catch or convert incompatible types. Type mismatches can silently break joins, merges, and aggregations.
Que 100. What’s the role of np.where() in conditional operations?
Answer:np.where()
is used for element-wise conditional logic:
np.where(condition, value_if_true, value_if_false)
It is much faster than using for
loops or apply()
in Pandas for large arrays.
Also Check: Data Engineer Interview Questions and Answers
Pandas and NumPy Interview Questions PDF
We have compiled all the Pandas and NumPy interview questions and answers in a PDF. It’s easy to read, covers all key topics, and perfect for last minute revision before your interview.
FAQs: Pandas and NumPy Interview Questions
What is the role of Pandas and NumPy?
Pandas and NumPy are essential Python libraries for data manipulation and numerical computing. NumPy provides support for fast mathematical operations on arrays, while Pandas offers powerful data structures like Series and DataFrames, making it easier to analyze, clean, and organize large datasets.
What challenges do candidates face during Pandas and NumPy interviews?
Candidates often struggle with applying vectorized operations, understanding complex indexing, and solving real-world data manipulation problems. Interviewers usually test hands-on knowledge, so lack of practice with real datasets and functions like groupby
, merge
, and slicing can make it tough.
What are common job challenges for professionals using Pandas and NumPy?
Professionals using these libraries often deal with messy or incomplete data, memory management for large datasets, and performance optimization. Understanding how to write efficient, readable code and debug data transformations is key to using Pandas and NumPy effectively.
How important is it to know both Pandas and NumPy for data roles?
Knowing both is very important for data roles, especially for data analysts, data scientists, and machine learning engineers. NumPy provides the foundation for numerical operations, while Pandas builds on top of it to make data analysis more intuitive and structured.
What is the average salary for professionals skilled in Pandas and NumPy in the USA?
Professionals with strong Pandas and NumPy skills, especially when combined with broader Python and data analysis knowledge, can earn between $80,000 and $120,000 annually in the USA. Senior data scientists and engineers can earn over $140,000 depending on experience and industry.
Which top companies hire professionals with Pandas and NumPy skills?
Top companies such as Google, Microsoft, Amazon, Netflix, Meta, IBM, and financial firms like JPMorgan and Goldman Sachs actively hire professionals with expertise in Pandas and NumPy for roles in analytics, AI, and data engineering.
Why is interview preparation important for Pandas and NumPy roles?
Preparation is crucial because interviews often include practical coding exercises, data cleaning tasks, and logic-based problem-solving using these libraries. Practicing real-world scenarios and mastering the most-used functions improves both speed and accuracy in interviews.
Also Check: