使用部分缓冲区将多字节Unicode字节数组转换为NSString

在Objective C中，有一种将多字节Unicode字节数组转换为NSString的方法，即使数组数据是部分缓冲区（不是完整的字符边界），也可以使转换成功。

这个应用程序在接收stream中的字节缓冲区时，要分析数据缓冲区的string版本（但是会有更多的数据来，并且缓冲区数据没有完整的多字节Unicode）。

NSString的initWithData:encoding:方法不适用于此目的，如下所示…

testing代码：

  - (void)test { char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'}; size_t sizeOfMyArray = sizeof(myArray); [self dump:myArray sizeOfMyArray:sizeOfMyArray]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4]; [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5]; } - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength { NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding]; NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string); }

输出：

 sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar' sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba' sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b' sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×' sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)' sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'

可以看出，转换“sourceLength：4字节”字节数组失败，并返回(null) 。这是因为UTF-8 unicode“×”字符（0xc3 0x97）只是部分包含在内。

理想情况下，会有一个函数，我可以使用这将返回正确的NString，并告诉我有多less字节“剩余”。

你基本上有你自己的答案。如果initWithData:dataWithBytes:encoding:方法返回nil ，那么您知道缓冲区末尾有一个部分（无效）字符。

修改dump以返回一个int 。然后试图在循环中创buildNSString 。每次你得到nil ，减less长度，然后再试一次。一旦你得到一个有效的NSString ，返回使用的长度和传递的长度之间的差异。

我之前有过这个问题，忘了一段时间。这是一个机会去做。下面的代码是用wikipedia上的utf-8页面的信息完成的。这是一个关于NSData的类别。

它检查从最后的数据，只有最后四个字节，因为OP说，它可以是千兆字节的数据。否则，使用utf-8从头开始运行字节会更简单。

 /* Return the range of a valid utf-8 encoded text by removing partial trailing multi-byte char. It assumes that all the bytes are valid utf-8 encoded char, eg it don't raise a flag if a continuation byte is preceded by a single char byte. */ - (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes { NSRange validRange = {0, 0}; NSUInteger trailLength = MIN([self length], 4U); unsigned char trail[4]; [self getBytes:&trail range:NSMakeRange([self length] - trailLength, trailLength)]; unsigned multibyteCount = 0; for (NSInteger i = trailLength - 1; i >= 0; i--) { if (isUTF8SingleByte(trail[i])) { validRange = NSMakeRange(0, [self length] - trailLength + i + 1); break; } if (isUTF8ContinuationByte(trail[i])) { multibyteCount++; continue; } if (isUTF8StartByte(trail[i])) { multibyteCount++; if (multibyteCount == lengthForUTF8StartByte(trail[i])) { validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount); } else { validRange = NSMakeRange(0, [self length] - trailLength + i); } break; } } return validRange; }

这里是方法中使用的静态函数：

 static BOOL isUTF8SingleByte(const unsigned char c) { return c <= 0x7f; } static BOOL isUTF8ContinuationByte(const unsigned char c) { return (c >= 0x80) && (c <= 0xbf); } static BOOL isUTF8StartByte(const unsigned char c) { return (c >= 0xc2) && (c <= 0xf4); } static BOOL isUTF8InvalidByte(const unsigned char c) { return (c == 0xc0) || (c == 0xc1) || (c > 0xf4); } static unsigned lengthForUTF8StartByte(const unsigned char c) { if ((c >= 0xc2) && (c <= 0xdf)) { return 2; } else if ((c >= 0xe0) && (c <= 0xef)) { return 3; } else if ((c >= 0xf0) && (c <= 0xf4)) { return 4; } return 1; }

这是我的低效执行，我不认为这是一个正确的答案。我会把它留在这里，以防别人觉得有用（并希望别人会给出比这更好的答案！）

它在NSMutableData的类别中…

  /** * Removes the biggest string possible from this NSMutableData, leaving any remainder unicode half-characters behind. * * NOTE: This is a very inefficient implementation, it may require multiple parsing of the complete NSMutableData buffer, * it is especially inefficient when the data buffer does not contain a valid string encoding, as all lengths will be * attempted. */ - (NSString *)removeMaximumStringUsingEncoding:(NSStringEncoding)encoding { if (self.length > 0) { // Quick test for the case where the whole buffer can be used (is common case, and doesn't require NSData manipulation). NSString *result = [[NSString alloc] initWithData:self encoding:encoding]; if (result != Nil) { self.length = 0; // Simple case, we used the whole buffer. return result; } // Try to find the largest subData that is a valid string. for (NSUInteger subDataLength = self.length - 1; subDataLength > 0; subDataLength--) { NSRange subDataRange = NSMakeRange(0, subDataLength); result = [[NSString alloc] initWithData:[self subdataWithRange:subDataRange] encoding:encoding]; if (result != Nil) { // Delete the bytes we used from our buffer, leave the remainder. [self replaceBytesInRange:subDataRange withBytes:Nil length:0]; return result; } } } return @""; }

使用部分缓冲区将多字节Unicode字节数组转换为NSString

MonoTouch – xib.designer.cs丢失UIKit对象完全限定path的文件

向select器传递一个参数

如何从页脚内的button传递indexPath继续？

超级没有“超级”关键字调用超级

滚动后，自定义TableView单元格中的标签消失

有什么方法可以从iPhone上的标准input读取，无论是硬件或模拟器？

如何以编程方式从地址簿ios编辑电话号码值

花时间从URL加载图像到UIImageview

触动延迟

大中央调度和function