Data Cleaning In SAS Programming Language

Sujan Katwal
8 min readApr 30, 2023

Today I am going to do data cleaning in an audible dataset using some of the functions in SAS. This dataset has different characters and numerical columns with missing values and special characters.

This Audible dataset can be downloaded from kaggle.com - https://www.kaggle.com/datasets/rohitdass/audible-dataset

1. Importing dataset in SAS Studio:
After downloading the CSV file from kaggle.com, we can import the file into our SAS Studio.

/* importing the csv dataset in SAS Studio */
filename audible '/home/u63049952/Projects/Audible_Dataset_final.csv';
proc import datafile=audible replace out=audible_data replace dbms=csv ;
getnames=yes;

2. Sneak Peek at our dataset:
Displaying random 10 observations as sample from the dataset.

 /* Define the sample size */
%let n=10;

/* randomly selecting 10 different observations from the dataset */
proc surveyselect data=audible_data out=sample_audio method=srs sampsize=&n;
run;

/* Display the sample data */
title "sample audible data";

proc print data=sample_audio;
run;
/*clearing the title*/
title;

3. Data Cleaning:
3.1 Renaming the names of the columns:
To make analysis easier, we will be renaming certain column names that have spaces between them by replacing the spaces with underscores. This will eliminate potential complications that could arise from spaces in the column names.

data dataset_rename;
/* instead of dropping review 1 to review 100 we are using keep statement */
set audible_data
(keep='Book Title'n 'Book Author'n 'Book Narrator'n 'Audio Runtime'n
Audiobook_Type Categories Rating 'Total No. of Ratings'n price);

/* renaming columns name with spaces between words */
rename 'Book Title'n=title 'Book Author'n=authors
'Book Narrator'n=Book_Narrator 'Audio Runtime'n=Audio_Runtime
'Total No. of Ratings'n=Total_No_of_Ratings;
run;

3.2 Removing the Special Characters
Here we remove the special characters that are present in the columns book title and authors name. Also, if there are multiple authors then we only take the first author name as authors. Similarly, for book title, we remove anything after symbol ‘(‘ or ‘/’.


data dataset_specialchar;
set dataset_rename;
/* removing book's title after '(' symbol */
title=scan(title, 1, '(');
/* removing book's title after '/' symbol */
title=scan(title, 1, '/');
title=compress(title, '.');
title=compress(title, '@');
title=compress(title, ':');
title=compress(title, '/');
title=compress(title, '&');
title=compress(title, '*');
title=compress(title, ',');
title=compress(title, '-');
title=compress(title, '!');

/* keeping only the main author name by removing
the secondary authors name after / */
authors=scan(authors, 1, '/');
authors=compress(authors, '.');
authors=compress(authors, '@');
authors=compress(authors, ':');
authors=compress(authors, '/');
authors=compress(authors, '&');
authors=compress(authors, '*');
authors=compress(authors, ',');
authors=compress(authors, '-');
authors=compress(authors, '!');

run;
Book Title Before and After

3.3 Converting Character Columns to Numerical
Since the columns Rating and Audio RunTime are character types but we want to convert them into numerical types for our analysis.


data char_to_num;
set dataset_specialchar;
/* converting the rating column from char to num*/
if Rating='Not rated yet' then
Rating=' ';

ratingNumericConverted=input(Rating, 8.2);
format ratingNumericConverted 8.2;
drop Rating;

/*Converting Audio Runtime in minutes*/
hrs=input(scan(Audio_Runtime, 1, ' '), best12.);
mins=input(scan(Audio_Runtime, 4, ' '), best12.);
audioRuntimeInMinutes=hrs * 60 + mins;

/*renaming*/
rename audioRuntimeInMinutes=Runtime_Minutes ratingNumericConverted=Rating;
/*dropping the unnecessary columns*/
drop hrs mins Audio_Runtime;
run;
Before and after converting char to num of audio_runtime column

4. Removing the duplicates records
There are different ways for removing the duplicate data in our dataset . In this article, we are going to use two methods using nodupcrecs in proc sort and first. in if condition. Noduprecs remove duplicates observations if the entire observations are completely similar whereas using first. we are removing observation whose book title and author name is same.

/* removing the duplicates data */
/*Method 1*/
proc sort data=char_to_num noduprecs;
/* checks the entire columns for the duplicates records*/
by title authors;
run;

/*Method 2*/
data dataset_noduplicates;
set char_to_num;
by title authors;
retain count;
/*checks if two obervations have title and authors same or not*/
if first.title and first.authors then
count=1;
else
count+1;
/*deleting duplicates records*/
if count > 1 then
delete;
drop count;
run;
Before and after removing duplicates

5. Missing and Non-Missing Values in the dataset.
Here, we are going to find out how many values are missing in each column.

proc format;
value $missfrm ' '='Missing' other='Not Missing';
run;

title"Missing Values in Character Columns";

proc freq data=dataset_noduplicates;
format _char_ $missfrm.;
tables _char_ / missing nocum nopercent;
run;

title;
title"Missing Values in Numerical Columns";

proc means data=dataset_noduplicates n nmiss;
run;

title;
Missing Values in character columns and numerical columns.

5.1 Null Values in Character Columns:
We are removing the Character types observations having null values from our dataset.


data dataset_filter;
set dataset_noduplicates;
if not missing(book_narrator) and not missing(audiobook_type) and
not missing(categories);
run;

5.2 Outliers and Null Values in Numerical Columns:
For the null values and outliers in numerical columns, unlike character columns we are replacing them with median (Q2). For this we are using proc univariate and macros. In addition, we are also transforming the skewness of our colmn Total_No_of_Ratings data in the range of -0.5 to 0.5.

/*finding the median and skewness of :Total_No_of_Ratings*/
title"Histogram of Total_No_of_Ratings before removing outliers:";
ods noproctitle;
proc univariate data=dataset_filter noprint;
var Total_No_of_Ratings;
histogram;
output out=ratings_stat_1 pctlpre=perc pctlpts=25 50 75
mean=mean_v skewness=skewness_v;
run;
Histogram before removing outliers and null values.
Statistics before removing null values and outliers.

Replacing the null values and outliers with the median. Outliers are the values that are smaller than Q1 — Inter Quartile Range or greater than Q3 + Inter Quartile Range where Inter Quartile Range= Q3 — Q1.


/* temporary dataset */
data _null_;
/* Use OBS=1 option to read only the first observation */
set ratings_stat_1(obs=1);
/* creating macros*/
call symputx('ar_q1', perc25);
call symputx('ar_q2', perc50);
call symputx('ar_q3', perc75);
run;

*replacing outliers and null values with median;
data temp_1;
set dataset_filter;
Q1=&ar_q1.;
Q2=&ar_q2.;
Q3=&ar_q3.;

if Total_No_of_Ratings > (Q3 + 1.5*(Q3 - Q1))
and Total_No_of_Ratings < (Q1 - 1.5*(Q3 - Q1)) then
/* replacing the missing values with median(Q2) */
Total_No_of_Ratings=Q2;
if missing(Total_No_of_Ratings) then
Total_No_of_Ratings=Q2;
drop Q1 Q2 Q3;
run;

proc univariate data=temp_1 noprint;
var Total_No_of_Ratings;
output out=ratings_stat_2 pctlpre=perc
pctlpts=25 50 75 mean=mean_v skewness=skewness_v;
run;

title "Statistics of Total_No_of_Ratings after removing outliers";
proc print data=ratings_stat_2;
run;

Even after replacing null values and outliers with median the skewness of our column data is 5.56. Hence we will Apply the log transformation to obtain skewness of data between -0.5 to 0.5.


/* Normalizing data using log transformation */
data transformation_rating;
set temp_1;
/*Applying log transformation*/
log_Total_No_of_Ratings=log(Total_No_of_Ratings+10);
run;

title"Statistics of Total_No_of_Ratings after log transformation:";
proc univariate data=transformation_rating noprint;
var log_Total_No_of_Ratings;
histogram;
output out=ratings_stat_final pctlpre=perc pctlpts=25 50 75
skewness=skewness_v;
run;

title"Statistics of Total_No_of_Ratings after log transformation";
proc print data=ratings_stat_final;
run;
Distribution of data after log transformation

I'm relieved to say that the skewness of our data now falls within the range of -0.5 to 0.5, indicating a generally symmetrical distribution.

While we previously used proc univariate and macros to achieve this, proc SQL offers a simpler alternative. For the column Runtime_Minutes, we will use the proc sql.

/*finding the outliers*/
proc sql;
create table outliers as select runtime_minutes from transformation_rating
having (runtime_minutes not between mean(runtime_minutes) -
2 * std(runtime_minutes) and mean(runtime_minutes) +
2 * std(runtime_minutes)) or runtime_minutes is missing ;
quit;

title" Data having outliers";
proc print data=outliers;
run;
/*Removing the outliers*/
proc sql;
create table outliers_free_data as select runtime_minutes,
case when runtime_minutes not between mean(runtime_minutes) - 2 * std(runtime_minutes)
and mean(runtime_minutes) + 2 * std(runtime_minutes) or
missing(runtime_minutes)then mean(runtime_minutes) else runtime_minutes
end as audio_runtime_mins from transformation_rating;
quit;
Data before and after replacing the null values and outliers

Here, we were successfully able to replace the missing values and outliers with the mean value. Although we remove the outliers and null values from our dataset, we still have checked the skewness of our dataset. Our goal is to obtain the skewness of our data in column runtime_minutes in the range of -0.5 to 0.5.

title"Histogram of runtime_minutes before";
ods select histogram;
proc univariate data=outliers_free_data noprint;
var runtime_minutes;
histogram / normal ;
output out=stats_runtime_minutes pctlpre=prec pctlpts=25 50 75 mean=mean_v
skewness=skewness_v ;
run;

title"Skewness before removing outliers and null values";
proc print data=stats_runtime_minutes;
run;

title"Histogram of runtime_minutes after";
ods select histogram;
proc univariate data=outliers_free_data noprint;
var audio_runtime_mins;
histogram / normal ;
output out=stats_audio_runtime_mins pctlpre=prec pctlpts=25 50 75
mean=mean_v skewness=skewness_v ;
run;

title "Skewness after removing outliers and null values";
proc print data=stats_audio_runtime_mins;
run;
Histogram (<- ) before and after (->) removing the null values and outliers.
Skewness before removing the null values and outliers
Skewness after removing the null values and outliers

Although the skewness of runtime minutes is not in the range of -0.5 to 0.5, it is much lower than the initial skewness of 3.10. We can use log transform or square root transform to lower its skewness value. Since 0.72 is slightly greater than 0.5 so for now I'm leaving its skewness value as it is.

If you want to replace the missing values in a dataset with the mean value by ignoring the skewness of the data, you can use a simple code snippet with the “proc stdize” function. For instance, you can apply this technique to the “rating” column, which has 296 missing values, by calculating the mean value of the column and replacing the missing values with it. By doing so, despite no change in mean value, you can still observe the change in the std dev value of the “rating” column before and after replacing the missing values with the mean value.



/* means with empty value */
title"Before:";
proc means data=outliers_free_data ;
var rating;
run;

/* removing empty values with means */
proc stdize data=outliers_free_data reponly method=mean
/* dropping the unnecessary columns */
out=Clean_data(drop=runtime_minutes Total_No_of_Ratings);
var rating ;
run;
/* means after removing empty values with mean value */
title "After:";
proc means data=Clean_data ;
var rating;
run;
Before and after replacing the null value with the mean value

I'm double-checking to ensure that our dataset is complete and doesn't have any missing values. Despite removing null values and outliers using various operations, I'll verify once again to make sure that there are no null values in our dataset.

/* Finally checking if there are any missing values or not */
proc format;
value $missfrm ' '='Missing' other='Not Missing';
run;

title"Missing Values in Character Columns";
proc freq data=Clean_data ;
format _char_ $missfrm.;
tables _char_ / missing nocum nopercent;
run;

title;
title"Missing Values in Numerical Columns";
proc means data=Clean_data n nmiss;
run;
Clean Data

Conclusion:
We have been able to handle missing values in our dataset effectively using different techniques in SAS. Additionally, we were able to achieve symmetry in our data by replacing the outliers with the mean or median and then applying log transformation. Our approach allowed us to maintain the skewness of the data within the acceptable range of -0.5 to 0.5.

--

--