iOS上的数组的Swift金属并行计算

基于@Kametrixom的答案 ,我已经做了一些testing应用程序并行计算总和在一个数组中。

我的testing应用程序如下所示:

import UIKit import Metal class ViewController: UIViewController { // Data type, has to be the same as in the shader typealias DataType = CInt override func viewDidLoad() { super.viewDidLoad() let data = (0..<10000000).map{ _ in DataType(200) } // Our data, randomly generated var start, end : UInt64 var result:DataType = 0 start = mach_absolute_time() data.withUnsafeBufferPointer { buffer in for elem in buffer { result += elem } } end = mach_absolute_time() print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParallel4(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParralel(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParallel3(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") } func sumParralel(data : Array<DataType>) -> DataType { let count = data.count let elementsPerSum: Int = Int(sqrt(Double(count))) let device = MTLCreateSystemDefaultDevice()! let parsum = device.newDefaultLibrary()!.newFunctionWithName("parsum")! let pipeline = try! device.newComputePipelineStateWithFunction(parsum) var dataCount = CUnsignedInt(count) var elementsPerSumC = CUnsignedInt(elementsPerSum) let resultsCount = (count + elementsPerSum - 1) / elementsPerSum // Number of individual results = count / elementsPerSum (rounded up) let dataBuffer = device.newBufferWithBytes(data, length: strideof(DataType) * count, options: []) // Our data in a buffer (copied) let resultsBuffer = device.newBufferWithLength(strideof(DataType) * resultsCount, options: []) // A buffer for individual results (zero initialized) let results = UnsafeBufferPointer<DataType>(start: UnsafePointer(resultsBuffer.contents()), count: resultsCount) // Our results in convenient form to compute the actual result later let queue = device.newCommandQueue() let cmds = queue.commandBuffer() let encoder = cmds.computeCommandEncoder() encoder.setComputePipelineState(pipeline) encoder.setBuffer(dataBuffer, offset: 0, atIndex: 0) encoder.setBytes(&dataCount, length: sizeofValue(dataCount), atIndex: 1) encoder.setBuffer(resultsBuffer, offset: 0, atIndex: 2) encoder.setBytes(&elementsPerSumC, length: sizeofValue(elementsPerSumC), atIndex: 3) // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1) // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times) let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1) encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) encoder.endEncoding() var result : DataType = 0 cmds.commit() cmds.waitUntilCompleted() for elem in results { result += elem } return result } func sumParralel1(data : Array<DataType>) -> UnsafeBufferPointer<DataType> { let count = data.count let elementsPerSum: Int = Int(sqrt(Double(count))) let device = MTLCreateSystemDefaultDevice()! let parsum = device.newDefaultLibrary()!.newFunctionWithName("parsum")! let pipeline = try! device.newComputePipelineStateWithFunction(parsum) var dataCount = CUnsignedInt(count) var elementsPerSumC = CUnsignedInt(elementsPerSum) let resultsCount = (count + elementsPerSum - 1) / elementsPerSum // Number of individual results = count / elementsPerSum (rounded up) let dataBuffer = device.newBufferWithBytes(data, length: strideof(DataType) * count, options: []) // Our data in a buffer (copied) let resultsBuffer = device.newBufferWithLength(strideof(DataType) * resultsCount, options: []) // A buffer for individual results (zero initialized) let results = UnsafeBufferPointer<DataType>(start: UnsafePointer(resultsBuffer.contents()), count: resultsCount) // Our results in convenient form to compute the actual result later let queue = device.newCommandQueue() let cmds = queue.commandBuffer() let encoder = cmds.computeCommandEncoder() encoder.setComputePipelineState(pipeline) encoder.setBuffer(dataBuffer, offset: 0, atIndex: 0) encoder.setBytes(&dataCount, length: sizeofValue(dataCount), atIndex: 1) encoder.setBuffer(resultsBuffer, offset: 0, atIndex: 2) encoder.setBytes(&elementsPerSumC, length: sizeofValue(elementsPerSumC), atIndex: 3) // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1) // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times) let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1) encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) encoder.endEncoding() cmds.commit() cmds.waitUntilCompleted() return results } func sumParallel3(data : Array<DataType>) -> DataType { var results = sumParralel1(data) repeat { results = sumParralel1(Array(results)) } while results.count >= 100 var result : DataType = 0 for elem in results { result += elem } return result } func sumParallel4(data : Array<DataType>) -> DataType { let queue = NSOperationQueue() queue.maxConcurrentOperationCount = 4 var a0 : DataType = 0 var a1 : DataType = 0 var a2 : DataType = 0 var a3 : DataType = 0 let op0 = NSBlockOperation( block : { for i in 0..<(data.count/4) { a0 = a0 + data[i] } }) let op1 = NSBlockOperation( block : { for i in (data.count/4)..<(data.count/2) { a1 = a1 + data[i] } }) let op2 = NSBlockOperation( block : { for i in (data.count/2)..<(3 * data.count/4) { a2 = a2 + data[i] } }) let op3 = NSBlockOperation( block : { for i in (3 * data.count/4)..<(data.count) { a3 = a3 + data[i] } }) queue.addOperation(op0) queue.addOperation(op1) queue.addOperation(op2) queue.addOperation(op3) queue.suspended = false queue.waitUntilAllOperationsAreFinished() let aaa: DataType = a0 + a1 + a2 + a3 return aaa } } 

我有一个着色器,看起来像这样:

 kernel void parsum(const device DataType* data [[ buffer(0) ]], const device uint& dataLength [[ buffer(1) ]], device DataType* sums [[ buffer(2) ]], const device uint& elementsPerSum [[ buffer(3) ]], const uint tgPos [[ threadgroup_position_in_grid ]], const uint tPerTg [[ threads_per_threadgroup ]], const uint tPos [[ thread_position_in_threadgroup ]]) { uint resultIndex = tgPos * tPerTg + tPos; // This is the index of the individual result, this var is unique to this thread uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end for (; dataIndex < endIndex; dataIndex++) sums[resultIndex] += data[dataIndex]; } 

在我惊讶的functionsumParallel4是最快的,我认为它不应该是。 我注意到,当我调用函数sumParralelsumParallel3 ,即使我改变函数的顺序,第一个函数总是比较慢。 (所以如果我先调用sumParralel这个比较慢,如果我调用sumParallel3这个比较慢)。

为什么是这样? 为什么sumParallel3比sumParallel快不了多less? 为什么sumParallel4是最快的,虽然它在CPU上计算?


我如何使用posix_memalign更新我的GPUfunction? 我知道它应该工作得更快,因为它将共享GPU和CPU之间的内存,但我不知道巫婆arrays应该这样分配(数据或结果),我怎么能分配数据与posix_memalign如果数据是parameter passing函数?

在iPhone 6上运行这些testing时,我看到金属版本的运行速度比天真CPU总和慢3倍和2倍。 随着我在下面描述的修改,它一直更快。

我发现运行Metal版本的很多成本可能不仅仅归因于缓冲区的分配,尽pipe这很重要,而且也是首次创build设备和计算pipe道状态。 这些是您在应用程序初始化时通常会执行的操作,因此将它们包含在时间中并不完全公平。

还应该注意的是,如果您通过Xcode运行这些testing,并且启用了金属validation层和GPU帧捕获,那么运行时间成本就会很高,并且会导致CPU的偏好。

有了这些警告,以下是如何使用posix_memalign分配可用于备份MTLBuffer 。 诀窍是确保您请求的内存实际上是页面alignment的(即,其地址是getpagesize()的倍数),这可能需要将内存量加起来,而不是实际需要存储数据的多less:

 let dataCount = 1_000_000 let dataSize = dataCount * strideof(DataType) let pageSize = Int(getpagesize()) let pageCount = (dataSize + (pageSize - 1)) / pageSize var dataPointer: UnsafeMutablePointer<Void> = nil posix_memalign(&dataPointer, pageSize, pageCount * pageSize) let data = UnsafeMutableBufferPointer(start: UnsafeMutablePointer<DataType>(dataPointer), count: (pageCount * pageSize) / strideof(DataType)) for i in 0..<dataCount { data[i] = 200 } 

这需要将dataUnsafeMutableBufferPointer<DataType> ,而不是[DataType] ,因为Swift的Array分配了自己的后备存储。 由于可变缓冲区指针的count已经四舍五入以使缓冲区页alignment,所以还需要传递数据项的计数以进行操作。

要真正创build一个由此数据支持的MTLBuffer ,请使用newBufferWithBytesNoCopy(_:length:options:deallocator:) API。 再次重要的是,您提供的长度是页面大小的倍数; 否则这个方法返回nil

 let roundedUpDataSize = strideof(DataType) * data.count let dataBuffer = device.newBufferWithBytesNoCopy(data.baseAddress, length: roundedUpDataSize, options: [], deallocator: nil) 

在这里,我们不提供释放器,但是当你完成使用时,你应该释放内存,把缓冲区指针的baseAddress传递给free()