Ozcan Ozturk

Research Highlights

Research Assistant Positions Available: Looking for bright, creative, self-motivated, and hardworking students to work in our funded projects. If you are interested in one of the following research topics, you can contact Dr. Ozcan Ozturk to discuss research opportunities.

Accelerators: Accelerator Technologies, Heterogeneous Systems, Manycore Accelerators

Parallel Systems: Heterogeneous Clusters, GPU-based Systems, Efficient Parallelization, Resource Management, Cloud Computing, Parallel Programming

Processor Architecture: Multicore Processors, Computer Organization, Reliability-Aware 3D Chip Multiprocessor Design, Heterogeneous Chip Multiprocessors, Network On Chip Architectures

Compilers: Automatic Parallelization, Dataflow Analysis, Optimizing Compilers, Memory Optimization

Exploiting Similarity in Deep Neural Networks for Efficient Video Processing

Deep learning (DL) models need to be run efficiently under real-time and stringent execution requirements. Real-time video processing is extensively utilized in crucial autonomous systems, whereby the preservation of accuracy and computing efficiency is of utmost importance. Typically, processing a single frame requires many computations and important energy consumption. Nevertheless, consecutive frames exhibit similar content and produce similar results. The proposed methodology incorporates a regularization technique on a layerwise granularity, which aims to improve computational efficiency by promoting weight similarity throughout the training process.

RISC-V based Neural Architecture Search

Neural architecture search (NAS) can efficiently be executed using a custom processor with specialized instructions. In order to automate the design of artificial neural networks (ANN), NAS can be implemented with custom instructions that outperform the hand-designed architectures. As the base architecture of our NAS-processor we chose RISC-V, an open instruction set architecture (ISA) introduced in 2010. RISC-V is available under free licenses and allow anyone to design or manufacture RISC-V chips with additional benefits. More specifically, RISC-V is proven to be suitable for domain-specific designs. Some areas where custom RISC-V chips with application-specific extensions have been designed include digital signal processing, security, and isolated execution. Some features about the RISC-V processor are as below.

HLS-based High-throughput and Work-efficient Synthesizable Graph Processing Template Pipeline

Hardware systems composed of diverse execution resources are being deployed to cope with the complexity and performance requirements of Artificial Intelligence (AI) and Machine Learning (ML) applications. With the emergence of new hardware platforms, system-wide programming support has become much more important. While this is true for various devices ranging from CPUs to GPUs, it is especially critical for specific neural network accelerators implemented on FPGAs. For example, Intel’s recent HARP platform encompasses a Xeon CPU and an FPGA, which requires an intense software stack to be used effectively. Programming such a hybrid system will be a challenge for most of the non-expert users. High-level language solutions such as Intel OpenCL for FPGA try to address the problem. However, as the abstraction level increases, the efficiency of implementation decreases, depicting two opposing requirements. In this work, we propose a framework to generate HLS-based, FPGA-accelerated, high-throughput/work-efficient, synthesizable, and template-based graph-processing pipeline. While a fixed and clock-wise precisely designed deep-pipeline architecture, written in SystemC, is responsible for processing graph vertices, the user implements the intended iterative graph algorithm by implementing/modifying only a single module in C/C++. This way, efficiency and high performance can be achieved with better programmability and productivity. With similar programming efforts, it is shown that the proposed template outperforms a high-throughput OpenCL baseline by up to 50% in terms of edge throughput. Furthermore, the novel work-efficient design significantly improves execution time and power consumption by up to 100×.

Energy Efficient Boosting of GEMM Accelerators for DNN

Neural networks, a subfield of artificial intelligence, perform a combination of linear and nonlinear functions. These functions are implemented through multiple layers, each fulfilling a special algorithm. Convolution is a common method to extract information from images. Therefore, convolutional neural networks (CNN) have been widely used in many data mining and machine learning domains in recent years. They are the pillars supporting many important tasks, from object recognition to autonomous driving, gesture recognition, and so on. For a specific task, the parameters of the functions in each layer of a CNN are optimized through a process called training. After the training step, running the network for accomplishing the task is called inference. A CNN is generally trained once on powerful compute nodes such as GPU or TPU. However, the inference is performed many times on the edge device. Therefore, it is critical to perform inference with high performance and energy efficiency, since the target device usually has limited resources.

General Reuse-Centric CNN Accelerator

Reuse-centric CNN acceleration speeds up CNN inference by reusing computations for similar neuron vectors in CNN’s input layer or activation maps. This new paradigm of optimizations is however largely limited by the overheads in neuron vector similarity detection, an important step in reuse-centric CNN. This approach presents the first in-depth exploration of architectural support for reuse-centric CNN. It proposes a hardware accelerator, which improves neuron vector similarity detection and reduces the energy consumption of reuse-centric CNN inference. The accelerator is implemented to support a wide variety of network settings with a banked memory subsystem.

Processor Design for Graph Applications

The aim of this project is to design a processor architecture that will run large datasets on irregular graphs quickly, efficiently and in an easily programmable fashion. Most of the studies previously presented for graph applications are only at the software or accelerator level, and the processor-level designs are different from the processor to be introduced for this project in terms of hardware cost, required architectural support, and instruction set modifications.

Developing Algorithms on Xeon + FPGA Platforms

Using Accelerator + CPU Technologies in Graph Parallel Applications implemented on Intel HARP Platform. Users implement accelerator hardware on FPGA and execute applications on both FPGA and CPU simultaneously. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions that can execute on FPGA and CPU at the same time.

Source-to-Source Transformation Based Methodology for Accelerators

Using a compiler or a code analysis tool, a program written in a certain language can be optimized for the same language or partly/completely translated into a different target language. However, improvements can be beyond optimization and efficiency, such as enhancing the user experience. Although hardware accelerators increase the efficiency in orders of magnitude, the underlying implementation for these architectures such as GPU, FPGA, and ASIC, are much harder when compared to a high level language such as C++. Therefore, if a developer cannot use the accelerators due to cost of hardware and adversities of the implementation, he or she will have to accept the outcome of CPU execution. On the other hand, the existence of a tool that accepts a simple C++ code and can execute on an FPGA will provide benefits of both worlds and will allow executing on larger graphs.

Hardware Acceleration for Graph Analytics Applications

Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this research, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.

Cloud Computing

Manycore accelerators are being deployed in the Cloud to improve the processing capabilities and to provide heterogeneity as Cloud Computing Systems become increasingly complex and process ever-increasing datasets. In such systems, application scheduling and data mapping need to be enhanced to maximize the utilization of the underlying architecture. To achieve this, Cloud management schemes require a fresh look when the underlying components are heterogeneous in many different ways. Moreover, applications differ in the way they perform on various special accelerators. Our goal is to design a runtime management system for Cloud Systems with manycore accelerators.

GPU Computing - Manycore Accelerators

Manycore accelerators are being deployed in many systems to improve the processing capabilities. In such systems, application mapping need to be enhanced to maximize the utilization of the underlying architecture. Especially in GPUs, mapping becomes critical for multi-kernel applications as kernels may exhibit different characteristics. While some of the kernels run faster on GPU, others may refer to stay in CPU due to the high data transfer overhead. Thus, heterogeneous execution may yield to improved performance compared to executing the application only on CPU or only on GPU. We would like to design systems with smart kernel mapping.

Automatic Parallelization

The importance of parallel programming has dramatically increased with the emergence of multicore and manycore architectures. We specifically focus on tools and techniques that enables the programmer to develop parallel programs wasily. The task of identifying parallel code sections for different manycore architectures including Intel MIC can be carried out by the compiler. We aim to alleviate the compiler infrastructure to parallelize the applications automatically.

Reliability-Aware 3D Chip Multiprocessor Design

Ability to stack separate chips in a single package enables three- dimensional integrated circuits (3D ICs). Heterogeneous 3D ICs provide even better opportunities to reduce the power and increase the performance per unit area. An important issue in designing a heterogeneous 3D IC is reliability. To achieve this, one needs to select the data mapping and processor layout carefully. We try to address this problem using an ILP approach.

Heterogeneous Chip Multiprocessors

Increasing complexity of applications and their large dataset sizes make it imperative to consider novel architectures that are efficient from both performance and power angles. Chip Multiprocessors (CMP) are one such example where multiple processor cores are placed into the same die. As technology scales, the International Technology Roadmap for Semiconductors (ITRS) projects that the number of cores in a chip multiprocessor (CMP) will drastically increase to satisfy performance requirements of future applications. A critical question that needs to be answered in CMPs is the size and strength of the cores. Homogeneous chip multiprocessors provide only one type of core to match these various application requirements, consequently not fully utilizing the available chip area and power budget. The ability to dynamically switch between different cores, and power down unused cores gives a key advantage to heterogeneous chip multiprocessing. One of the challenging problems in the context of heterogeneous chip multiprocessor systems is the placement of processor cores and storage blocks within the available chip area. Focusing on such a heterogeneous chip multiprocessor, we address different design decision problems.

Dataflow Analysis

Memory is a key parameter in embedded systems since both code complexity of embedded applications and amount of data they process are increasing. While it is true that the memory capacity of embedded systems is continuously increasing, the increases in the application complexity and dataset sizes are far greater. As a consequence, the memory space demand of code and data should be kept minimum. To reduce the memory space consumption of embedded systems, this paper proposes a control flow graph (CFG) based technique. Specifically, it tracks the lifetime of instructions at the basic block level. Based on the CFG analysis, if a basic block is known to be not accessible in the rest of the program execution, the instruction memory spaceallocated to this basic block is reclaimed. On the other hand, if the memory allocated to this basic block cannot be reclaimed, we try to compress this basic block. This way, it is possible to effectively use the available on-chip memory, thereby satisfying most of instruction/ data requests from the on-chip memory.

NoC Based Heterogeneous Systems

As commonly accepted, current performance trajectory to double the chip performance every 24 to 36 months, can be achieved by the integration of multiple processors on a chip rather than through increases in the clock rate of single processors due to the power limitations present in processor design. Multicore architectures have already made their way in the industry, with more aggressive configurations being prototyped such as the Intel's 80 core TeraFlop. Since future technologies offer the promise of being able to integrate billions of transistors on a chip, the prospects of having hundreds of processors on a single chip along with an underlying memory hierarchy and an interconnection system is entirely feasible. Point-to-point buses will no longer be feasible after a certain number of nodes as the communication requirements between nodes will exponentially increase with the number of processors. A viable interconnection system shown to be promising for these future CMPs is Network-on-Chip (NoC) since it provides scalable, flexible, and programmable communication. With this NoC based chip multiprocessor (NoC based CMP) as the computing platform, a very rich set of research challenges arise. Circuit and architectural challenges such as router design, IP placement, and sensor placement are currently being studied in both industry and academia. In comparison, the work on heterogeneous alternatives for these architectures has received considerably less attention.

Memory Optimization

A critical component of a chip multiprocessor is its memory subsystem. This is because both power and performance behavior of a chip multiprocessor is largely shaped by its on-chip memory. While it is possible to employ conventional memory designs such as pure private memory or pure shared memory, such designs are very general and rigid, and may not generate the best behavior for a given embedded application. Our belief is that, for embedded systems that repeatedly execute the same application, it makes sense to design a customized, software-managed on-chip memory architecture. Such a memory architecture should be a hybrid one that contains both private and shared components. In the hybrid architecture case, while some processors have private memories, others do not have one. Similarly, the different processor groups can share memory in different fashions. For example, a memory component can be shared by two processors, whereas another component can be shared by three processors.

Designing such a customized hybrid memory architecture is not trivial because of at least three main reasons. First, since the memory architecture to be designed changes from one application to another, a hand-waived approach is not suitable, as it can be extremely time consuming and error-prone to go through the same complex process each time we are to design a memory system for a new application. Therefore, we need an automated strategy that comes up with the most suitable design for a given application. Second, design of such a memory needs to be guided by a tool that can extract the data sharing exhibited by the application at hand. After all, in order to decide how the different memory components need to be shared by parallel processors, one needs to capture the data sharing pattern across the processors. Third, data allocation in a hybrid memory system is not a trivial problem, and should be carried out along with data partitioning if we are to obtain the best results.