iOS上的数组的Swift金属并行计算

基于@Kametrixom的答案，我已经做了一些testing应用程序并行计算总和在一个数组中。

我的testing应用程序如下所示：

import UIKit import Metal class ViewController: UIViewController { // Data type, has to be the same as in the shader typealias DataType = CInt override func viewDidLoad() { super.viewDidLoad() let data = (0..<10000000).map{ _ in DataType(200) } // Our data, randomly generated var start, end : UInt64 var result:DataType = 0 start = mach_absolute_time() data.withUnsafeBufferPointer { buffer in for elem in buffer { result += elem } } end = mach_absolute_time() print("CPU result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParallel4(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParralel(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") result = 0 start = mach_absolute_time() result = sumParallel3(data) end = mach_absolute_time() print("Metal result: \(result), time: \(Double(end - start) / Double(NSEC_PER_SEC))") } func sumParralel(data : Array<DataType>) -> DataType { let count = data.count let elementsPerSum: Int = Int(sqrt(Double(count))) let device = MTLCreateSystemDefaultDevice()! let parsum = device.newDefaultLibrary()!.newFunctionWithName("parsum")! let pipeline = try! device.newComputePipelineStateWithFunction(parsum) var dataCount = CUnsignedInt(count) var elementsPerSumC = CUnsignedInt(elementsPerSum) let resultsCount = (count + elementsPerSum - 1) / elementsPerSum // Number of individual results = count / elementsPerSum (rounded up) let dataBuffer = device.newBufferWithBytes(data, length: strideof(DataType) * count, options: []) // Our data in a buffer (copied) let resultsBuffer = device.newBufferWithLength(strideof(DataType) * resultsCount, options: []) // A buffer for individual results (zero initialized) let results = UnsafeBufferPointer<DataType>(start: UnsafePointer(resultsBuffer.contents()), count: resultsCount) // Our results in convenient form to compute the actual result later let queue = device.newCommandQueue() let cmds = queue.commandBuffer() let encoder = cmds.computeCommandEncoder() encoder.setComputePipelineState(pipeline) encoder.setBuffer(dataBuffer, offset: 0, atIndex: 0) encoder.setBytes(&dataCount, length: sizeofValue(dataCount), atIndex: 1) encoder.setBuffer(resultsBuffer, offset: 0, atIndex: 2) encoder.setBytes(&elementsPerSumC, length: sizeofValue(elementsPerSumC), atIndex: 3) // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1) // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times) let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1) encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) encoder.endEncoding() var result : DataType = 0 cmds.commit() cmds.waitUntilCompleted() for elem in results { result += elem } return result } func sumParralel1(data : Array<DataType>) -> UnsafeBufferPointer<DataType> { let count = data.count let elementsPerSum: Int = Int(sqrt(Double(count))) let device = MTLCreateSystemDefaultDevice()! let parsum = device.newDefaultLibrary()!.newFunctionWithName("parsum")! let pipeline = try! device.newComputePipelineStateWithFunction(parsum) var dataCount = CUnsignedInt(count) var elementsPerSumC = CUnsignedInt(elementsPerSum) let resultsCount = (count + elementsPerSum - 1) / elementsPerSum // Number of individual results = count / elementsPerSum (rounded up) let dataBuffer = device.newBufferWithBytes(data, length: strideof(DataType) * count, options: []) // Our data in a buffer (copied) let resultsBuffer = device.newBufferWithLength(strideof(DataType) * resultsCount, options: []) // A buffer for individual results (zero initialized) let results = UnsafeBufferPointer<DataType>(start: UnsafePointer(resultsBuffer.contents()), count: resultsCount) // Our results in convenient form to compute the actual result later let queue = device.newCommandQueue() let cmds = queue.commandBuffer() let encoder = cmds.computeCommandEncoder() encoder.setComputePipelineState(pipeline) encoder.setBuffer(dataBuffer, offset: 0, atIndex: 0) encoder.setBytes(&dataCount, length: sizeofValue(dataCount), atIndex: 1) encoder.setBuffer(resultsBuffer, offset: 0, atIndex: 2) encoder.setBytes(&elementsPerSumC, length: sizeofValue(elementsPerSumC), atIndex: 3) // We have to calculate the sum `resultCount` times => amount of threadgroups is `resultsCount` / `threadExecutionWidth` (rounded up) because each threadgroup will process `threadExecutionWidth` threads let threadgroupsPerGrid = MTLSize(width: (resultsCount + pipeline.threadExecutionWidth - 1) / pipeline.threadExecutionWidth, height: 1, depth: 1) // Here we set that each threadgroup should process `threadExecutionWidth` threads, the only important thing for performance is that this number is a multiple of `threadExecutionWidth` (here 1 times) let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1) encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup) encoder.endEncoding() cmds.commit() cmds.waitUntilCompleted() return results } func sumParallel3(data : Array<DataType>) -> DataType { var results = sumParralel1(data) repeat { results = sumParralel1(Array(results)) } while results.count >= 100 var result : DataType = 0 for elem in results { result += elem } return result } func sumParallel4(data : Array<DataType>) -> DataType { let queue = NSOperationQueue() queue.maxConcurrentOperationCount = 4 var a0 : DataType = 0 var a1 : DataType = 0 var a2 : DataType = 0 var a3 : DataType = 0 let op0 = NSBlockOperation( block : { for i in 0..<(data.count/4) { a0 = a0 + data[i] } }) let op1 = NSBlockOperation( block : { for i in (data.count/4)..<(data.count/2) { a1 = a1 + data[i] } }) let op2 = NSBlockOperation( block : { for i in (data.count/2)..<(3 * data.count/4) { a2 = a2 + data[i] } }) let op3 = NSBlockOperation( block : { for i in (3 * data.count/4)..<(data.count) { a3 = a3 + data[i] } }) queue.addOperation(op0) queue.addOperation(op1) queue.addOperation(op2) queue.addOperation(op3) queue.suspended = false queue.waitUntilAllOperationsAreFinished() let aaa: DataType = a0 + a1 + a2 + a3 return aaa } }

我有一个着色器，看起来像这样：

 kernel void parsum(const device DataType* data [[ buffer(0) ]], const device uint& dataLength [[ buffer(1) ]], device DataType* sums [[ buffer(2) ]], const device uint& elementsPerSum [[ buffer(3) ]], const uint tgPos [[ threadgroup_position_in_grid ]], const uint tPerTg [[ threads_per_threadgroup ]], const uint tPos [[ thread_position_in_threadgroup ]]) { uint resultIndex = tgPos * tPerTg + tPos; // This is the index of the individual result, this var is unique to this thread uint dataIndex = resultIndex * elementsPerSum; // Where the summation should begin uint endIndex = dataIndex + elementsPerSum < dataLength ? dataIndex + elementsPerSum : dataLength; // The index where summation should end for (; dataIndex < endIndex; dataIndex++) sums[resultIndex] += data[dataIndex]; }

在我惊讶的functionsumParallel4是最快的，我认为它不应该是。我注意到，当我调用函数sumParralel和sumParallel3 ，即使我改变函数的顺序，第一个函数总是比较慢。（所以如果我先调用sumParralel这个比较慢，如果我调用sumParallel3这个比较慢）。

为什么是这样？为什么sumParallel3比sumParallel快不了多less？为什么sumParallel4是最快的，虽然它在CPU上计算？

我如何使用posix_memalign更新我的GPUfunction？我知道它应该工作得更快，因为它将共享GPU和CPU之间的内存，但我不知道巫婆arrays应该这样分配（数据或结果），我怎么能分配数据与posix_memalign如果数据是parameter passing函数？

在iPhone 6上运行这些testing时，我看到金属版本的运行速度比天真CPU总和慢3倍和2倍。随着我在下面描述的修改，它一直更快。

我发现运行Metal版本的很多成本可能不仅仅归因于缓冲区的分配，尽pipe这很重要，而且也是首次创build设备和计算pipe道状态。这些是您在应用程序初始化时通常会执行的操作，因此将它们包含在时间中并不完全公平。

还应该注意的是，如果您通过Xcode运行这些testing，并且启用了金属validation层和GPU帧捕获，那么运行时间成本就会很高，并且会导致CPU的偏好。

有了这些警告，以下是如何使用posix_memalign分配可用于备份MTLBuffer 。诀窍是确保您请求的内存实际上是页面alignment的（即，其地址是getpagesize()的倍数），这可能需要将内存量加起来，而不是实际需要存储数据的多less：

 let dataCount = 1_000_000 let dataSize = dataCount * strideof(DataType) let pageSize = Int(getpagesize()) let pageCount = (dataSize + (pageSize - 1)) / pageSize var dataPointer: UnsafeMutablePointer<Void> = nil posix_memalign(&dataPointer, pageSize, pageCount * pageSize) let data = UnsafeMutableBufferPointer(start: UnsafeMutablePointer<DataType>(dataPointer), count: (pageCount * pageSize) / strideof(DataType)) for i in 0..<dataCount { data[i] = 200 }

这需要将data为UnsafeMutableBufferPointer<DataType> ，而不是[DataType] ，因为Swift的Array分配了自己的后备存储。由于可变缓冲区指针的count已经四舍五入以使缓冲区页alignment，所以还需要传递数据项的计数以进行操作。

要真正创build一个由此数据支持的MTLBuffer ，请使用newBufferWithBytesNoCopy(_:length:options:deallocator:) API。再次重要的是，您提供的长度是页面大小的倍数; 否则这个方法返回nil ：

 let roundedUpDataSize = strideof(DataType) * data.count let dataBuffer = device.newBufferWithBytesNoCopy(data.baseAddress, length: roundedUpDataSize, options: [], deallocator: nil)

在这里，我们不提供释放器，但是当你完成使用时，你应该释放内存，把缓冲区指针的baseAddress传递给free() 。

iOS上的数组的Swift金属并行计算

数组中string的UISearchBar

jssorting多列不工作在IE和Safari

从字典数组填充表部分和行

在Swift中使用“Map”创build两个数组的超集

Swift：使用UISearchController / Predicates过滤结构数组

用CoreData中保存的坐标将注释固定到mapView

Swift使用Int32数组的大小

在数组中给出错误的Swift函数：'@lvalue $ T24'与'CGFloat'不一样

如何保存一个快速数组？

Swift数组附加覆盖其他数组值