SQL (Structured Query Language) is a programming language used to manage and manipulate relational databases. It is important for data analytics because it allows analysts to extract, manipulate, and analyze data stored in relational databases.
You can retrieve data using the SELECT statement.
Example: SELECT * FROM employees;
A JOIN is used to combine rows from two or more tables based on a related column. Types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.
Example: SELECT a.name, b.salary FROM employees a INNER JOIN salaries b ON a.id = b.employee_id;
WHERE is used to filter records before any groupings are made, while HAVING is used to filter records after grouping.
Example: SELECT department, COUNT(*) FROM employees WHERE age > 30 GROUP BY department HAVING COUNT(*) > 10;
You can find duplicates using the GROUP BY and HAVING clauses.
Example: SELECT name, COUNT(*) FROM employees GROUP BY name HAVING COUNT(*) > 1;
A subquery is a query within another query.
Example: SELECT name FROM employees WHERE id IN (SELECT employee_id FROM salaries WHERE salary > 50000);
You can update data using the UPDATE statement.
Example: UPDATE employees SET salary = 60000 WHERE id = 1;
DELETE removes rows one by one and can have a WHERE clause, TRUNCATE removes all rows without logging individual row deletions.
Example: DELETE FROM employees WHERE id = 1; vs TRUNCATE TABLE employees;
You can create a table using the CREATE TABLE statement.
Example: CREATE TABLE employees (id INT, name VARCHAR(100), age INT);
Indexes are used to speed up the retrieval of rows by creating a data structure that allows quick lookup of values.
Example: CREATE INDEX idx_name ON employees(name);
UNION removes duplicate records, UNION ALL includes duplicates.
Example: SELECT name FROM employees1 UNION SELECT name FROM employees2;
You can use IS NULL, IS NOT NULL, COALESCE, or IFNULL functions.
Example: SELECT COALESCE(name, 'Unknown') FROM employees;
A primary key is a unique identifier for a record in a table.
Example: ALTER TABLE employees ADD PRIMARY KEY (id);
Use the LOWER or UPPER functions.
Example: SELECT * FROM employees WHERE LOWER(name) = 'john';
CHAR is fixed-length, VARCHAR is variable-length.
Example: CREATE TABLE employees (name CHAR(50)); vs CREATE TABLE employees (name VARCHAR(50));
Use the LIMIT clause.
Example: SELECT * FROM employees ORDER BY salary DESC LIMIT 10;
A stored procedure is a set of SQL statements that can be stored and executed on the database server.
Example: CREATE PROCEDURE GetEmployees() BEGIN SELECT * FROM employees; END;
Normalization is the process of organizing data to reduce redundancy and improve data integrity.
Example: Dividing a single table into multiple tables to eliminate duplicate data.
You can use a subquery.
Example: SELECT a.name, (SELECT salary FROM salaries WHERE employee_id = a.id) FROM employees a;
A foreign key is a field in one table that uniquely identifies a row of another table.
Example: ALTER TABLE salaries ADD FOREIGN KEY (employee_id) REFERENCES employees(id);
Power BI is a business analytics tool by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards.
Power BI Desktop, Power BI Service, Power BI Mobile, Power BI Gateway, Power BI Report Server, and Power BI Embedded.
Use the Get Data feature in Power BI Desktop to connect to various data sources like SQL Server, Excel, web, etc.
DAX (Data Analysis Expressions) is a formula language used in Power BI, Power Pivot, and Analysis Services for creating custom calculations and aggregations.
Example: SUM(Sales[Amount])
Go to the Modeling tab, select New Column, and enter your DAX formula.
Example: SUM(Sales[Amount])
A measure is a calculation used in aggregations, often created using DAX.
Example: SUM(Sales[Amount])
Use Power BI Desktop to connect to data, transform it, and use visualization tools to create charts, graphs, and other visual elements.
Dashboards are single-page, often interactive, visual representations of data, created from reports to provide at-a-glance insights.
Publish reports to the Power BI Service and share them with others using sharing links, workspaces, or embedding them in websites or applications.
Power Query is used for data ingestion and transformation, allowing users to extract, transform, and load (ETL) data from various sources.
Set up a data refresh schedule in the Power BI Service to update datasets automatically from the connected data sources.
Slicers are visual tools that allow users to filter data in reports and dashboards interactively.
Example: Adding a date slicer to filter sales data by date.
Use the Model view to drag and drop fields to create relationships, or use the Manage Relationships feature.
Power BI Q&A allows users to ask questions about their data in natural language and get answers in the form of visualizations.
Bookmarks capture the current state of a report page, including filters and visuals, allowing users to save and navigate to specific views.
Themes are predefined sets of colors and formatting that can be applied to reports to ensure consistency and improve visual appeal.
Use techniques such as reducing data load, optimizing DAX queries, using aggregations, and indexing data sources.
Power BI Pro is a per-user license that allows for sharing and collaboration, while Power BI Premium offers dedicated resources and enhanced performance for larger-scale deployments.
Use the R and Python visuals in Power BI Desktop to run scripts and create custom visualizations.
The Power BI Service is a cloud-based platform for sharing, collaborating on, and managing Power BI reports and dashboards.
Python is a high-level programming language known for its simplicity and readability. It is used in data analytics for its powerful libraries and tools for data manipulation, analysis, and visualization.
Use the pandas library.
Example: import pandas as pd; df = pd.read_csv('file.csv')
pandas, NumPy, matplotlib, seaborn, and scikit-learn.
Use pandas functions like dropna() and fillna().
Example: df.dropna() or df.fillna(0)
Use the merge function in pandas.
Example: pd.merge(df1, df2, on='key')
A lambda function is an anonymous function defined with the lambda keyword.
Example: lambda x: x + 1
Use libraries like matplotlib and seaborn.
Example: import matplotlib.pyplot as plt; plt.plot(data)
The groupby function is used to group data by one or more columns and apply aggregate functions.
Example: df.groupby('column').sum()
Use the LinearRegression class from scikit-learn.
Example: from sklearn.linear_model import LinearRegression; model = LinearRegression(); model.fit(X, y)
A list is an ordered collection of items, while a dictionary is an unordered collection of key-value pairs.
Example: list_example = [1, 2, 3] vs dict_example = {'key': 'value'}
A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.
Example: import pandas as pd series = pd.Series([1, 2, 3]) dataframe = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
Use the loc method or boolean indexing.
Example: df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) filtered_df = df[df['A'] > 1]
Use the apply method.
Example: df['C'] = df['A'].apply(lambda x: x * 2)
The pivot_table function is used to create a spreadsheet-style pivot table as a DataFrame.
Example: pivot_df = df.pivot_table(values='B', index='A', aggfunc='sum')
Use the pd.to_datetime function to convert a column to datetime.
Example: df['date'] = pd.to_datetime(df['date_column'])
Use the concat function with axis=0.
Example: df1 = pd.DataFrame({'A': [1, 2]}) df2 = pd.DataFrame({'A': [3, 4]}) concatenated_df = pd.concat([df1, df2], axis=0)
The pivot function is used to reshape data where you need a new column for each unique value in a specified column.
Example: pivot_df = df.pivot(index='date', columns='item', values='value')
Use the corr method.
Example: correlation_matrix = df.corr()
Use the drop_duplicates method.
Example: df = df.drop_duplicates()
Use the fillna or dropna methods.
Example: df = df.fillna(0) # Fill missing values with 0 df = df.dropna() # Drop rows with missing values
A lambda function is an anonymous function defined with the lambda keyword. It is used for short, throwaway functions.
Example: df['C'] = df['A'].apply(lambda x: x * 2)
Use the pd.read_excel function.
Example: df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
Use the to_csv method.
Example: df.to_csv('file.csv', index=False)
The groupby function is used to group data by one or more columns and apply aggregate functions.
Example: grouped_df = df.groupby('column').sum()
Use the plot method from pandas or libraries like matplotlib and seaborn.
Example: import matplotlib.pyplot as plt df['A'].plot(kind='bar') plt.show()
Use the rolling method.
Example: df['moving_avg'] = df['A'].rolling(window=3).mean()
The merge function is used to combine two DataFrames based on a key column.
Example: merged_df = pd.merge(df1, df2, on='key')
Use statistical methods to identify and handle outliers, such as Z-scores or the IQR method.
Example: from scipy import stats df = df[(np.abs(stats.zscore(df['A'])) < 3)]
The describe method provides summary statistics of the DataFrame.
Example: summary_stats = df.describe()
The melt function unpivots a DataFrame from wide to long format.
Example: melted_df = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
The iloc method is used for integer-location based indexing for selection by position.
Example: subset = df.iloc[0:5, 0:2]
Use the concat function with axis=1.
Example: df1 = pd.DataFrame({'A': [1, 2]}) df2 = pd.DataFrame({'B': [3, 4]}) concatenated_df = pd.concat([df1, df2], axis=1)
The astype method is used to cast a pandas object to a specified data type.
Example: df['A'] = df['A'].astype(float)
Use the get_dummies function.
Example: dummies = pd.get_dummies(df['category_column'])
Use the drop method.
Example: df = df.drop(columns=['column_to_drop'])