优化字符串解析

我需要以“txf”格式解析数据文件。 这些文件可能包含1000多个条目。 由于格式定义如JSON,我想制作一个像JSON这样的通用解析器,可以对txf文件进行序列化和反序列化。

与JSON相反,标记无法识别对象或数组。 如果出现具有相同标记的条目,我们需要将其视为数组。

  1. #标记对象的开头。
  2. $标记对象的成员
  3. /标记对象的结尾

以下是示例“txf”文件

 #Employees $LastUpdated=2015-02-01 14:01:00 #Employee $Id=1 $Name=Employee 01 #Departments $LastUpdated=2015-02-01 14:01:00 #Department $Id=1 $Name=Department Name /Department /Departments /Employee #Employee /Employee /Employees 

我能够使用NSScanner创建一个通用的TXF Parser 。 但随着更多的条目,性能需要更多的调整。

我写了以plist获得的基础对象,并将其性能再次与我编写的解析器进行了比较。 我的解析器比plist解析器慢大约10倍。

虽然plist文件大小是txf 5倍并且有更多的标记字符,但我觉得有很多优化空间。

在这方面的任何帮助都非常感谢。

编辑:包括解析代码

 static NSString *const kArray = @"TXFArray"; static NSString *const kBodyText = @"TXFText"; @interface TXFParser () /*Temporary variable to hold values of an object*/ @property (nonatomic, strong) NSMutableDictionary *dict; /*An array to hold the hierarchial data of all nodes encountered while parsing*/ @property (nonatomic, strong) NSMutableArray *stack; @end @implementation TXFParser #pragma mark - Getters - (NSMutableArray *)stack{ if (!_stack) { _stack = [NSMutableArray new]; }return _stack; } #pragma mark - - (id)objectFromString:(NSString *)txfString{ [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) { if ([string hasPrefix:@"#"]) { [self didStartParsingTag:[string substringFromIndex:1]]; }else if([string hasPrefix:@"$"]){ [self didFindKeyValuePair:[string substringFromIndex:1]]; }else if([string hasPrefix:@"/"]){ [self didEndParsingTag:[string substringFromIndex:1]]; }else{ //[self didFindBodyValue:string]; } }]; return self.dict; } #pragma mark - - (void)didStartParsingTag:(NSString *)tag{ [self parserFoundObjectStartForKey:tag]; } - (void)didFindKeyValuePair:(NSString *)tag{ NSArray *components = [tag componentsSeparatedByString:@"="]; NSString *key = [components firstObject]; NSString *value = [components lastObject]; if (key.length) { self.dict[key] = value?:@""; } } - (void)didFindBodyValue:(NSString *)bodyString{ if (!bodyString.length) return; bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]]; if (!bodyString.length) return; self.dict[kBodyText] = bodyString; } - (void)didEndParsingTag:(NSString *)tag{ [self parserFoundObjectEndForKey:tag]; } #pragma mark - - (void)parserFoundObjectStartForKey:(NSString *)key{ self.dict = [NSMutableDictionary new]; [self.stack addObject:self.dict]; } - (void)parserFoundObjectEndForKey:(NSString *)key{ NSDictionary *dict = self.dict; //Remove the last value of stack [self.stack removeLastObject]; //Load the previous object as dict self.dict = [self.stack lastObject]; //The stack has contents, then we need to append objects if ([self.stack count]) { [self addObject:dict forKey:key]; }else{ //This is root object,wrap with key and assign output self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key]; } } #pragma mark - Add Objects after finding end tag - (void)addObject:(id)dict forKey:(NSString *)key{ //If there is no value, bailout if (!dict) return; //Check if the dict already has a value for key array. NSMutableArray *array = self.dict[kArray]; //If array key is not found look for another object with same key if (array) { //Array found add current object after wrapping with key NSDictionary *currentDict = [self wrapObject:dict withKey:key]; [array addObject:currentDict]; }else{ id prevObj = self.dict[key]; if (prevObj) { /* There is a prev value for the same key. That means we need to wrap that object in a collection. 1. Remove the object from dictionary, 2. Wrap it with its key 3. Add the prev and current value to array 4. Save the array back to dict */ [self.dict removeObjectForKey:key]; NSDictionary *prevDict = [self wrapObject:prevObj withKey:key]; NSDictionary *currentDict = [self wrapObject:dict withKey:key]; self.dict[kArray] = [@[prevDict,currentDict] mutableCopy]; }else{ //Simply add object to dict self.dict[key] = dict; } } } /*Wraps Object with a key for the serializer to generate txf tag*/ - (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{ if (!key ||!obj) { return @{}; } return @{key:obj}; } 

编辑2:

包含1000个以上条目的示例TXF文件 。

您是否考虑过使用拉式读取和递归处理? 这将消除将整个文件读入内存并消除管理一些自己的堆栈以跟踪您正在解析的深度。

下面是Swift的一个例子。 该示例适用于您的示例“txf”,但不适用于Dropbox版本; 你的一些“成员”跨越多行。 如果这是一个要求,它可以很容易地实现到switch/case "$"部分。 但是,我也没有看到你自己的代码处理它。 此外,该示例还没有遵循正确的Swifterror handling( parse方法需要一个额外的NSError参数)

 import Foundation extension String { public func indexOfCharacter(char: Character) -> Int? { if let idx = find(self, char) { return distance(self.startIndex, idx) } return nil } func substringToIndex(index:Int) -> String { return self.substringToIndex(advance(self.startIndex, index)) } func substringFromIndex(index:Int) -> String { return self.substringFromIndex(advance(self.startIndex, index)) } } func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary { var dict = Dictionary() while let line = aStreamReader.nextLine() { let firstChar = first(line) let theRest = dropFirst(line) switch firstChar! { case "$": if let idx = theRest.indexOfCharacter("=") { let key = theRest.substringToIndex(idx) let value = theRest.substringFromIndex(idx+1) dict[key] = value } else { println("no = sign") } case "#": let subDict = parse(aStreamReader,theRest) var list = dict[theRest] as? [Dictionary] if list == nil { dict[theRest] = [subDict] } else { list!.append(subDict) } case "/": if theRest != parentTagName { println("mismatch... [\(theRest)] != [\(parentTagName)]") } else { return dict } default: println("mismatch... [\(line)]") } } println("shouldn't be here...") return dict } var data : Dictionary? if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") { if var line = aStreamReader.nextLine() { let tagName = line.substringFromIndex(advance(line.startIndex, 1)) data = parse(aStreamReader, tagName) } aStreamReader.close() } println(JSON(data!)) 

StreamReader是从https://stackoverflow.com/a/24648951/95976借来的

编辑

  • 查看完整代码https://github.com/tofi9/QuickParser
  • objective-c中的pull-style逐行读取: 如何逐行读取NSFileHandle中的数据?

编辑2

我在C ++ 11中重写了上述内容,并使用dropbox上的更新文件在2012 MBA I5上以不到0.05秒(发布模式)运行它。 我怀疑NSDictionaryNSArray必须有一些惩罚。 下面的代码可以编译成一个objective-c项目(文件需要有扩展名.mm):

 #include  #include  #include  #include  #include  #include  using namespace std; class benchmark { private: typedef std::chrono::high_resolution_clock clock; typedef std::chrono::milliseconds milliseconds; clock::time_point start; public: benchmark(bool startCounting = true) { if(startCounting) start = clock::now(); } void reset() { start = clock::now(); } double elapsed() { milliseconds ms = std::chrono::duration_cast(clock::now() - start); double elapsed_secs = ms.count() / 1000.0; return elapsed_secs; } }; struct obj { map properties; map> subObjects; }; obj parse(ifstream& stream, string& parentTagName) { obj obj; string line; while (getline(stream, line)) { auto firstChar = line[0]; auto rest = line.substr(1); switch (firstChar) { case '$': { auto idx = rest.find_first_of('='); if (idx == -1) { ostringstream o; o << "no = sign: " << line; throw o.str(); } auto key = rest.substr(0,idx); auto value = rest.substr(idx+1); obj.properties[key] = value; break; } case '#': { auto subObj = parse(stream, rest); obj.subObjects[rest].push_back(subObj); break; } case '/': if(rest != parentTagName) { ostringstream o; o << "mismatch end of object " << rest << " != " << parentTagName; throw o.str(); } else { return obj; } break; default: ostringstream o; o << "mismatch line " << line; throw o.str(); break; } } throw "I don't know why I'm here. Probably because the file is missing an end of object marker"; } void visualise(obj& obj, int indent = 0) { for(auto& property : obj.properties) { cout << string(indent, '\t') << property.first << " = " << property.second << endl; } for(auto& subObjects : obj.subObjects) { for(auto& subObject : subObjects.second) { cout << string(indent, '\t') << subObjects.first << ": " << endl; visualise(subObject, indent + 1); } } } int main(int argc, const char * argv[]) { try { obj result; benchmark b; ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf"); string line; if (getline(stream, line)) { string tagName = line.substr(1); result = parse(stream, tagName); } cout << "elapsed " << b.elapsed() << " ms" << endl; visualise(result); }catch(string s) { cout << "error " << s; } return 0; } 

编辑3

请参阅完整代码C ++的链接: https : //github.com/tofi9/TxfParser

我在你的github源代码上做了一些工作 – 随后的两个变化我得到了30%的改进,虽然主要改进来自“优化1”

优化1 – 根据您的数据随附以下工作。

 + (int)locate:(NSString*)inString check:(unichar) identifier { int ret = -1; for (int i = 0 ; i < inString.length; i++){ if (identifier == [inString characterAtIndex:i]) { ret = i; break; } } return ret; } - (void)didFindKeyValuePair:(NSString *)tag{ #if 0 NSArray *components = [tag componentsSeparatedByString:@"="]; NSString *key = [components firstObject]; NSString *value = [components lastObject]; #else int locate = [TXFParser locate:tag check:'=']; NSString *key = [tag substringToIndex:locate]; NSString *value = [tag substringFromIndex:locate+1]; #endif if (key.length) { self.dict[key] = value?:@""; } } 

优化2:

 - (id)objectFromString:(NSString *)txfString{ [txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) { #if 0 if ([string hasPrefix:@"#"]) { [self didStartParsingTag:[string substringFromIndex:1]]; }else if([string hasPrefix:@"$"]){ [self didFindKeyValuePair:[string substringFromIndex:1]]; }else if([string hasPrefix:@"/"]){ [self didEndParsingTag:[string substringFromIndex:1]]; }else{ //[self didFindBodyValue:string]; } #else unichar identifier = ([string length]>0)?[string characterAtIndex:0]:0; if (identifier == '#') { [self didStartParsingTag:[string substringFromIndex:1]]; }else if(identifier == '$'){ [self didFindKeyValuePair:[string substringFromIndex:1]]; }else if(identifier == '/'){ [self didEndParsingTag:[string substringFromIndex:1]]; }else{ //[self didFindBodyValue:string]; } #endif }]; return self.dict; } 

希望它能帮到你。