SAS – the acronym stands for ‘Statistical Analysis System’ – has a long and illustrious history. Development of the first version began in 1966, in the days of punched card storage and mainframe computers.
As computing hardware became more powerful, so datasets grew. SAS in turn evolved to harness greater computational power and meet the data challenge, but the original concepts of data steps and proc steps endured. Then, in the 1990s, SQL (Structured Query Language) databases emerged and SAS found itself with some serious competition.
In this article we look at the SAS response to the SQL challenge, and examine how SAS lends itself to use today in data warehouse settings.
SQL was aimed at data maintenance and storage using relational tables rather than flat files. SQL approaches only became possible in the 1990s due to technology developments that allowed faster data processing.
SQL languages accomplish the same data editing tasks that data steps do in SAS, but they use a different approach. Unlike SAS, which is a procedural language, SQL is a declarative language:
While using efficient code in SQL can still improve performance, the efficiency of SQL is less dependent on the programmer's code and more dependent on the database’s query optimizer. Hence, maintaining data in a database became easier using SQL than it was using SAS’s data steps. In SQL, programmers had to learn a few basic commands that could perform a variety of tasks when used together. But with SAS data steps, programmers needed to study a broad set of commands, and they had to learn the most efficient way to assemble those commands to achieve optimal SAS performance.
What SQL cannot do, however, is analyze data the way SAS can. In the later 1990s and early 2000s, with many databases being stored and maintained in SQL, the SAS/Access feature made it possible to use SAS on them for analysis.
Using PROC SQL
As SQL became more established, so more programmers were trained in it. These SQL programmers had some trouble transferring their skills to use in SAS data steps, so SAS responded by developing a SQL language within SAS: PROC SQL.
PROC SQL has the following features:
Like SQL, PROC SQL is much easier to use than data step language for certain common tasks. One particularly useful task that is much easier in PROC SQL is creating a VIEW of the data, which allows the user to look at a particular section of the dataset.
Imagine we wanted to view the data in our example dataset, Chap_1_1_Infile, but we only wanted to look at the data for women (SEX1 =2) who live in Massachusetts (_STATE = 25). We could use this PROC SQL code:
PROC SQL;
Select * from Chap_1_1_Infile
where SEX1 = 2 and _STATE = 25;
quit;
This code produces output in the following
structure (with just the first three rows provided):
Table 1 – Output from PROC SQL
To get similar output using SAS commands, the following PROC PRINT code could be used. Note that all variables are displayed, in the order stored in the dataset, since the VAR statement is excluded:
PROC PRINT DATA=Chap_1_1_Infile;
where SEX1 = 2 and_STATE = 25;
RUN;
But imagine we did not want to return all the variables – assume we only wanted to return age (_AGE80) and BMI (_BMI5). We could easily replace the asterisk in our PROC SQL code to specify only those two columns:
PROC SQL;
Select _AGE80, _BMI5 from Chap_1_1_Infile
where SEX1 = 2 and_STATE = 25;
quit;
In PROC PRINT, to achieve the same output, we would add a VAR statement to our previous code:
PROC PRINT DATA=Chap_1_1_Infile;
where SEX1 = 2 and_STATE = 25;
var _AGE80 _BMI5;
RUN;
Even in this short example, it is easy to see how SAS PROCs and data steps are more complicated than SQL commands because SQL has fewer, more modular commands. By contrast, SAS has an extensive toolset of commands and options that, when understood and used wisely, can achieve just about any result with big data.
SAS in today’s warehouse environment
While PROC SQL appears to be a workaround from learning a complicated data step language, this is not the case in data
warehousing. Because of the lack of optimization of PROC SQL, in many environments it is very slow and can only feasibly be used with smaller datasets. Even today, when transforming big data in SAS, in most environments it is necessary to use data step language, and this affords the programmer an opportunity to develop optimized code, as efficiency is always important when dealing with data in SAS.
However, when interfacing with another database management system (DBMS) where native data are stored in SQL, SAS PROC SQL might be more useful. In his recent white paper on working with big data in SAS, Mark Jordan describes various modern approaches to improving the processing efficiency of both PROC SQL and SAS data steps in both server SAS environments and environments where SAS is used as the analysis engine, connecting to a non-SAS DBMS through SAS/Access. (See Further reading.)
Jordan describes two scenarios for big data storage and SAS:
Ultimately, the main bottleneck in SAS processing has to do with I/O, so the easier it is for the SAS analytic engine to interact with stored data, the faster processing will be. But even in this modern era, limitations surrounding data I/O continue to compel SAS users to develop efficient code.
Jordan provides the following tips for thinking about coding for a SAS data warehouse:
Note:
In his white paper, Jordan compares PROC SQL processing using the SCAN command with the LIKE command for retrieving a record with criteria set on a high-cardinality variable; the LIKE command is found to be more efficient.
SAS in the Cloud
Jordan also describes how SAS now has a new Viya architecture that offers cloud analytic services (CAS). A CAS library allows the
following capabilities:
A CAS library has application programming interfaces (APIs) that allow actions to be executed from a variety of languages, including Java, Python, and R, and of course, the SAS Version 9.4 client application.
Today not all warehouse data is stored in the Cloud, many datasets still being stored on traditional servers. Jordan recommends that if the user has an installation of the SAS 9.4M5 application and has access to SAS Viya CAS, and they want to decide whether or not to move to CAS from a traditional server, they should compare the processing time on a subset of data in both environments. Jordan was able to cut the processing time from over 1 minute to 2.35 seconds by moving his data from a traditional server to SAS Viya CAS.
Applications of SAS in modern warehousing
Today, SAS data warehousing is more complicated than it was in the past because there are so many options. Learning about these options can help the user envision the possibilities and design a SAS data warehousing system appropriate to their organization's needs. In this section we’ll look at:
Warehousing unstructured text
In his white paper on warehousing unstructured text in SAS, Nick Evangelopoulos describes how the IT Shared Services (ITSS) division at the University of North Texas (UNT) used SAS to study their service tickets to try to improve services (see Further reading). They took the following steps:
After doing this, the team wondered if SAS EM would classify the tickets under the same topic as the user entering the ticket would. To answer this question, the team analyzed 1,481 new tickets that were classified using SAS EM as well as being classified by the user. They found dramatic differences between how users and SAS EM classified the tickets, suggesting that this classification may need additional development in order to be useful.
Using SAS components for warehousing
A white paper by Li-Hui Chen and Manuel Figallo describes a modern SAS data warehouse using SAS applications (see Further reading). The US Department of Health and Human Services (DHHS) has a data warehouse of health indicators called the Health Indicators Warehouse (HIW). They described how they upgraded their SAS data warehouse system to improve performance and customer service using SAS Visual Analytics (VA) accessed through an API.
The HIW serves many users over the internet. Prior to the upgrade, SAS datasets were accessed from storage using SAS, and extract-transform-load (ETL) processes needed to take place manually on the data before it could be visualized. This made the data in the warehouse difficult to visualize.
With the upgrade, this is the new process:
1. Users obtain permission to access the API, which controls access to the underlying data as well as the VA capabilities.
2. Using the API, which contains a GUI, users indicate which health
indicator they want to extract from the HIW, and how they want to visualize it.
3. The API extracts the necessary data from the HIW data store using automated processing.
4. The API performs necessary ETL processes to support visualization.
5. The API then visualizes the results using VA.
Here’s a schematic diagram comparing the old and new systems:
Figure 1 – SAS warehousing system before and after adding an
API layer
Focusing on I/O, the authors pointed out that ETL in the API is achieved by running SAS macros, or code routines developed in the SAS macro language that can take user or system inputs and can be run automatically. They can run these macros either through a stored process (where the macro can be run on one dataset at a time) or a batched process (where the macro is run on several datasets at once). The authors found that they needed to use a batch process when transferring large amounts of HIW data through an API call.
Using SAS alongside R
SAS has been around for a long time and has typically been the first choice for warehousing big data. However, since the rise of SQL there has been competition between SAS and SQL for data storage functions. With the rise of R, open-source statistical software known for visualization and an easy web interface, SAS has additionally seen competition with respect to statistical analysis functions.
Over time, SAS has responded to competition by building in extra functionality. SAS/Access, SAS VA, and SAS Viya are all examples of this. However, the reality is that SAS is best at analytics, so other applications tend to be superior at these other functions. This has created challenges for legacy SAS warehouses that are now rethinking how they use SAS in their system. Teams are approaching this challenge with a variety of responses.
Dr. Elizabeth Atkinson shared her team's story of moving from a 100% SAS shop to incorporating R for some functions. She leads a biostatistics service at the Mayo Clinic, a famous specialty clinic in the US, which has been a SAS shop since 1974, when punch cards were still being used, and now has a staff of 300 at three locations. The service supports data storage and analysis for studies both large and small.
In 2014, Mayo went to negotiate their SAS license and found that the price had increased significantly. SAS has always been a distinctive product with a high price. According to a computerworld article, in 1995 a full SAS application development package, when bundled for 10 users, cost $1,575 per seat; this is expensive even by today's standards. However, in 2014, the increase in cost was felt to be unsustainable, and the Mayo team started looking for other options.
They wanted to decrease their dependence on SAS by moving some of their functions to R, while also improving customer service and satisfaction. They faced the following challenges:
To integrate R into their shop, they took the following steps:
Even after offering R as an alternative, many users chose to stay with SAS. The reasons the shop could not completely convert from R to SAS include the following:
One of the overall benefits of this effort was that it opened the larger conversation behind what skills would be needed among analysts in the division in the future. These considerations run parallel to the consideration as to what SAS and non-SAS components will be used in the data system in the near future, what roles they will play, how they will be supported, and how they will work together to improve the user experience.
Connecting to Snowflake
As data gets bigger and bigger, new solutions have been developed to store data in the Cloud. Microsoft Azure and Amazon Web Services (AWS) are cloud services that help move business operations to the Cloud. Snowflake (https://www.snowflake.com/) is a relatively new cloud data platform that runs on Microsoft Azure and AWS, and may run on other cloud services in the future.
Snowflake enables a programmer to build a virtual data warehouse with little cost, thus solving a data storage problem. However, data still needs to be accessed to be analyzed. Therefore, SAS upgraded its SAS/Access component to now be able to connect directly to Snowflake.
SAS documentation about connecting to Snowflake indicates that Snowflake uses SQL as its query language. Both PROC SQL and regular SAS functions can be passed to Snowflake, but there are cases where SAS and Snowflake function names conflict. Furthermore, careful setting of options and code tuning are needed to improve I/O from SAS to Snowflake.
Although products like Snowflake can solve the big data storage problem, the issue with SAS will always be I/O. Using the newest and most appropriate technology along with the most efficient coding approaches will always be the best strategy for dealing with the data warehousing of big data in SAS.
Conclusions
As SAS environments evolved from running on mainframes to being accessible by PCs, SAS developed new products and services to complement its core analytics and data management functions.
SAS data steps are procedural, and provide the programmer with opportunities to greatly improve I/O through the use of certain commands, features, and approaches to programming. When SQL became popular, PROC SQL was invented. This allowed SAS users to choose between using data steps or SQL commands when managing data in SAS.
Today, SAS is still used in data warehousing, but there are new challenges with accessing data in the Cloud. SAS data warehouses today can include predominantly SAS components, such as SAS VA and CAS. Or, SAS can be part of a warehouse system that includes other components and applications, such as cloud storage in Snowflake, and supplemental analytic functions provided by R.
Modern SAS data warehousing still seeks to improve I/O and to better serve warehouse users through the development of an efficient system that meets customer needs. Creativity is required in the design of modern SAS data warehouses so that the system can leverage the best SAS has to offer while avoiding its pitfalls.
Special 25% off at http://packt.live/2N5JsJ8 . Use discount code 25MONIKA (Promotion running from 10th February through 15th March).
Further reading
Working with Big Data in SAS® SAS white paper, by Mark Jordan – available at https://www.sas.com/content/dam/SAS/support/en/sas-globalforum-proceedings/2018/2160-2018.pdf
From Unstructured Text to the Data Warehouse: Customer Support at the University of North Texas SAS white paper, by Nick Evangelopoulos – available at https://www.sas.com/content/dam/SAS/support/en/sas-global-forumproceedings/2018/1900-2018.pdf
Bridging the Gap: Importing Health Indicators Warehouse Data into SAS® Visual Analytics Using SAS® Stored Processes and APIs SAS white paper, by Li-Hui Chen and Manuel Figallo – available at https://support.sas.com/resources/papers/proceedings16/10540-2016.pdf
Video of Dr. Elizabeth Atkinson – available at
https://rstudio.com/resources/rstudioconf-2018/a-sas-to-r-success-story/